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Abstract 

A quasi-metric is a distance function which satisfies the triangle inequality but is 
not symmetric: it can be thought of as an asymmetric metric. Quasi-metrics were 
first introduced in 1930s and are a subject of intensive research in the context of 
topology and theoretical computer science. 

The central result of this thesis, developed in Chapter 3, is that a natural corre- 
spondence exists between similarity measures between biological (nucleotide or 
protein) sequences and quasi-metrics. As sequence similarity search is one of the 
most important techniques of modern bioinformatics, this motivates a new direc- 
tion of research: development of geometric aspects of the theory of quasi-metric 
spaces and its applications to similarity search in general and large protein datasets 
in particular. 

The thesis starts by presenting basic concepts of the theory of quasi-metric 
spaces illustrated by numerous examples, some previously known, some novel. In 
particular, the universal countable rational quasi-metric space and its bicomple- 
tion, the universal bicomplete separable quasi-metric space are constructed. Sets 
of biological sequences with some commonly used similarity measures provide a 
further and the most important example. 

Chapter 4 is dedicated to development of a notion of the quasi-metric space 
with Borel probability measure, or pq-space. The concept of a pg-space is a gen- 
eralisation of a notion of an mm-space from the asymptotic geometric analysis: 
an mm-space is a metric space with Borel measure that provides the framework 
for study of the phenomenon of concentration of measure on high dimensional 
structures. While some concepts and results are direct extensions of results about 
mm-spaces, some are intrinsic to the quasi-metric case. One of the main results 
of this chapter indicates that 'a high dimensional quasi-metric space is close to 
being a metric space' . 



Chapter 5 investigates the geometric aspects of the theory of database similar- 
ity search. It extends the existing concepts of a workload and an indexing scheme 
in order to cover more general cases and introduces the concept of a quasi-metric 
tree as an analogue to a metric tree, a popular class of access methods for metric 
datasets. The results about pg-spaces are used to produce some new theoretical 
bounds on performance of indexing schemes. 

Finally, the thesis presents some biological applications. Chapter 6 introduces 
FSIndex, an indexing scheme that significantly accelerates similarity searches of 
short protein fragment datasets. The performance of FSIndex turns out to be 
very good in comparison with existing access methods. Chapter 7 presents the 
prototype of the system for discovery of short functional protein motifs called 
PFMFind, which relies on FSIndex for similarity searches. 
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Chapter 1 



Introduction 



The main focus of this thesis is on application of concepts of modern mathematics 
not previously used in biological context to problems of biological sequence sim- 
ilarity search as well as to the general theory of indexability of databases for fast 
similarity search. The biological applications are concentrated to investigations 
of short protein fragments using a novel tool, called FSIndex, which allows very 
fast retrieval of similarity based queries of datasets of short protein fragments. 

Clearly, this work stands at an intersection of several disciplines. The approach 
is mostly mathematical and rigorous where possible but also touches some aspects 
of the database theory and computational biology. The main result, presented in 
Chapter [3l shows that deep connections exist between quasi-metrics (asymmetric 
distance functions), and similarity measures on biological sequences. This moti- 
vates an effort to generalise the concepts and techniques from asymptotic geomet- 
ric analysis and database indexing that apply to metric spaces to their quasi-metric 
counterparts, and to apply the resulting structures to biological questions. 

The present chapter introduces the biological background associated with pro- 
teins and their short fragments and outlines the remainder of the thesis. It is as- 
sumed that general concepts related to biological macromolecules are well known 
and only those particularly relevant will be emphasised. Many important con- 
cepts will only be mentioned briefly and their detailed explanation left for the 
subsequent chapters. 
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CHAPTER 1. INTRODUCTION 



1.1 Proteins 
1.1.1 Basic concepts 

Proteins axe organic macromolecules consisting of amino acids joined by peptide 
bonds, essential for functioning of a living cell. They are involved in all major cel- 
lular processes, playing a variety of roles, such as catalytic (enzymes), structural, 
signalling, transport etc. 

Structurally, proteins are linear chains (polypeptides) composed of the twenty 
standard amino acids which can be classified according to their chemical proper- 
ties (Table [TTI) . A protein in the living cell is produced through the processes of 
transcription and translation. Simply stated, the information encoded by a gene 
on DNA is transcribed into a mRNA molecule which is then translated into a pro- 
tein on ribosomes by putting an amino acid for every codon triplet of nucleotides 
on mRNA. Constituent amino acids of a protein can be post-translationally modi- 
fied, for example by attaching a sugar or a phosphate group on their side chains. 

Four distinct aspects of protein structure are generally recognised. The pri- 
mary structure of a protein is the sequence of its constituent amino acids. The 
secondary structure refers to the local sub-structures such as a-helix, jS-sheet or 
random coil. The tertiary structure is the spatial arrangement of a single polypep- 
tide chain while the quaternary structure refers to the arrangements of multiple 
polypeptides (protein subunits) forming a protein complex. We refer to the tertiary 
and quaternary structures as conformations. 

Protein function in general is determined by the conformation but it is strongly 
believed that secondary, tertiary and quaternary structure are all determined by the 
amino acid sequence. So far, there has been no solution to the folding problem, 
which is to determine the conformation solely from the amino acid sequence by 
computational means. All presently known structures have been determined either 
experimentally, by using crystallographic or NMR (Nuclear Magnetic Resonance) 
techniques, or by homology modelling from closely related sequences with exper- 
imentally derived structures. 

While the number of possible amino acid sequences is very large, known pro- 
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Three 
Letter 
t_-oae 


One 
Letter 


Residue 
iviass 
(Da) 


Abundance 

{/c) 


Properties 


Glycine 


Glv 


G 


57.0 


6.93 


no side chain 


Alanine 


Ala 


A 


71.1 


7.80 




Valine 


Val 


V 


99.1 


6.69 














non-polar 


Isoleucine 


lie 


I 


113.2 


5.91 








aliphatic 


Leucine 


Leu 


L 


113.2 


9.62 


Methionine 


Met 


M 


131.2 


2.37 




Phenylalanine 


Phe 


F 


147.2 


4.02 


non-polar 


Tryptophan 


TrD 


W 


186.2 


1.16 


aromatic 


Serine 


Ser 


s 


87.1 


6.89 




Thieonine 


Thr 


T 


101.1 


5.46 




Asparagine 










polar aliphatic 


Asn 


N 


114.1 


4.22 


Glutamine 


Gin 





128.1 


3.93 




Tyrosine 


Tyr 


Y 


162.2 


3.09 


polar aromatic 


Lysine 


Lys 


K 


128.2 


5.93 




Arginine 


Arg 


R 


156.2 


5.29 


charged, basic 


Histidine 


His 


H 


137.1 


2.27 




Aspartic acid 

VJlLlLCllllH^ cX^lVJ. 


Asp 
nil! 


D 


115.1 

19Q 1 


5.30 


charged, acidic 


Cysteine 


Cys 


c 


103.1 


1.57 


forms disulphide 
bridges 


Proline 


Pro 


p 


97.1 


4.85 


cyclic, disrupts struc- 
ture 



Table 1.1: The standard amino acids. Residue mass is the mass of amino acid minus the 
mass of a molecule of water (18.0 Da). Relative abundances are taken from the Release 
44.0 of SwissProt sequence database |[23l . 



teins take a relatively small amount of conformations ni42[ |95l . There is an on- 
going effort to determine all possible conformations proteins can take, that is, to 
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produce a map of the conformation space [|95ll96ll97l . Such a map would enable 
modelling of all the structures which have not been experimentally determined 
using the existing structures of the similar proteins. 

A structural motif is a three-dimensional structural element or fold consisting 
of consecutive secondary structures, for example, the /?-barell motif. Structural 
motifs can but need not be associated with biological function. A structural do- 
main is a unit of structure having a specific function which combines several mo- 
tifs and which can fold independently. A protein sequence motif is a amino-acid 
pattern associated with a biological function. It may, but need not, be associated 
with a structural motif. 

1.1.2 Protein sequence alignment 

Sequence alignment is presently one of the cornerstones of computational biology 
and bioinformatics HISOL As mentioned before, all elements of protein structure 
and function ultimately depend on the sequence and in addition, sequence data is 
most readily available, mostly originating from the translations of the sequences of 
genes and transcripts obtained through large scale sequencing projects I1196[l213l 
such as the recently completed Human Genome Project P3l . Raw sequences pro- 
duced by the sequencing projects need to be annotated, that is, functional descrip- 
tions attached to each sequence and/or its constituent parts [I179L The most widely 
used (but not always adequate I1166[ 1691 ) technique for annotation is homology 
or similarity search where the unannotated sequences are annotated according to 
their similarity to previously annotated sequences [|24l resulting in great savings 
of time and effort required for experimental analysis of each sequence. 

Much of the sequence data is easily accessible from public repositories [l62| , 
the best known being the database collection at the National Center for Biotech- 
nology Information (NCBI - h ttp : / / www . ncbi . nlm. nih . gov| ) in the 
United States [I209II . The NCBI repository contains among many others the Gen- 
Bank ifTSl DNA sequence database, a part of the international collaboration in- 
volving its European (EMBL) [I117II and Japanese (DDBJ) HI 3911 counterparts and 
the RefSeq II158L the set of reference gene, transcript and protein sequences for a 
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variety of organisms. The major source of protein related resources is the ExPASy 
site |[67l at the Swiss Institute of Bioinformatics (http : / / www . expasy . org), 
the home of SwissProt, a human curated database of annotated protein sequences, 
and its companion TrEMBL, a database of machine- annotated translated coding 
sequences from EMBL [23^]. SwissProt and TrEMBL together form the Uniprot 
[[TOl universal protein resource. Uniprot has sequence composition similar to the 
NCBI RefSeq protein dataset. 

The principal technique for general pairwise biological sequence comparison 
is known as alignment^ . We distinguish a global alignment where the whole extent 
of both sequences is aligned and local alignment where only substrings (contigu- 
ous subsequences) are aligned. The foundations of the algorithms for sequence 
alignment have been developed in the 1970s and early 1980s [[T46l[T7ni203l[T78ll 
culminating with the famous Smith-Waterman I1177II algorithm for local sequence 
alignments. 

Pairwise sequence alignment is based on transformations of one sequence into 
other which is broken into transformations of substrings one sequence into sub- 
strings of other. Ultimately two types of transformations are used: substitutions 
where one residue (amino acid in proteins) is substituted for another and indels or 
insertions and deletions where a residue or a sequence fragment is inserted (in one 
sequence) or deleted (in the other). Indels are often called gaps and alignments 
without gaps are called ungapped. Each of the basic transformations is assigned 
a numerical score or weight and the transformation with the optimal score is re- 
ported as the 'best' alignment of the two sequences. All algorithms for computa- 
tion of pairwise alignments use the dynamic programming lfT3l technique. 

Alignment scores can be distances in which case all scores are positive and 
identity transformations (no changes) have the score 0. Distances are often re- 
quired to have additional properties such as to satisfy the triangle inequality. Al- 
ternatively, transformation scores may be given as similarities which are large 
and positive for matches (identity transformations) and some ('close') mismatches 

'The term 'alignment' is used to denote both the method of sequence comparison and a partic- 
ular transformation of one sequence into another. 
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while other mismatches and gaps have a negative score. The choice of whether to 
use similarities or distances is influenced by available computational algorithms: 
similarities are preferred in sequence comparisons because they are more suitable 
for local alignments while distances are often used in phylogenetics lf83l . Fur- 
thermore, similarity scores are, at least in some cases, amenable for statistical and 
information-theoretic interpretations lll05l [5l [T04l . 

According to the 'basic' alignment model, the transformation scores only de- 
pend on the residues being substituted in the case of substitutions, and lengths 
of the gaps in the case of indels. There is no dependence on the position of the 
transformation within the two sequences being compared nor on the previous or 
subsequent transformations. In this model, substitution scores come from score 
matrices, the best known being the PAM fllSl and BLOSUM |[88l families of 
amino acid matrices. Both PAM and BLOSUM matrices were derived from mul- 
tiple alignments (alignments of more than two sequences) of related proteins. 

The most widely used tool for sequence similarity search is BLAST (Basic 
Local Alignment Search Tool) ^ developed at the NCBL BLAST is a based on 
heuristic search algorithm which uses dynamic programming on only a relatively 
small part of the sequence database searched while retrieving most of the hits or 
neighbours. The importance of BLAST cannot be overestimated - its applications 
range from day-to-day use by biologists to find sequences similar to the sequences 
of their interest to high throughput automated annotation, sequence clustering and 
many others. Finding efficient algorithms which would improve on BLAST in 
accuracy and/or speed remains one of the areas of very active development HlOSi 

iToiiniiiii. 

While BLAST is quite fast and accurate, it cannot always retrieve all bio- 
logically significant homologs due to limitations of the basic alignment model. 
Improvements to the basic alignment model involve the use of Position Specific 
Score Matrices or PSSMs, also known as profiles iTTSl . which assign different sub- 
stitution scores at different positions. PSI-BLAST [[6l uses PSSMs through an it- 
erative technique where the results of each search are used to compute a PSSM for 
a subsequent iteration - the first search is performed using the basic model. This 
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method is known to retrieve more 'distant' homologues which would be missed 
using the basic model. More sophisticated sequence and alignment models such 
as Hidden Markov Models (HMMs) ll52l[53l[T06l[85l can be used with even more 
accuracy if there is sufficient data for their training. In most common cases, a sub- 
stantial body of statistical theory for interpretation of the results exists Il52ll54l . 

1.1.3 Short peptide fragments 

While most of the works relating to protein sequence analysis concentrate on ei- 
ther full sequences, or fragments of medium length (50 amino acids - e.g. 1112611 ). 
the main biological focus of this thesis is on short peptide fragments of lengths 6 
to 15. 

While short peptide fragments can be interesting as being parts of larger func- 
tional domains, they often have important physiological function on their own. To 
mention one of many examples, a large variety of peptides are generated in the 
gut lumen during normal digestion of dietary proteins and absorbed through the 
gut mucosa. Smaller fragments, that is dipeptides and tripeptides, are the primary 
source of dietary nitrogen. Larger peptides, many of which have been shown to 
have physiological activity may also be absorbed. These peptides may modulate 
neural, endocrine, and immune function 1122 1[ IllOL Short peptide motifs may 
also have a role in disease. For example, it was discovered that one of the proteins 
encoded by HIV-1 and Ebola viruses contains a conserved short peptide motif 
which, due to its interaction with host cell proteins involved in protein sorting, 
plays a significant role in progress of the disease [I132II . 

The biological part of this thesis aims to develop tools for identifying con- 
served fragment motifs among possibly otherwise unrelated protein sequences. 
Such tools may produce the results that would enable determination of the origin 
of fragments with no obvious function. The investigation is not restricted solely 
to bioactive peptides but considers all possible fragments (of given lengths) of full 
sequences available from the databases. 

The main paradigm can be expressed as follows: 
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A sequence fragment that recurs in a non random and unexpected pat- 
tern indicates a possible structural motif that has a biological func- 
tion. 

The approach taken here mirrors that of full sequence analysis - the principal 
technique used is similarity search using substitution matrices and profiles. How- 
ever, the sequence comparison model uses a global ungapped similarity measure 
comparing the fragments of the same length. This can be justified by computa- 
tional advantages - it leads to sequence comparisons of linear instead of quadratic 
complexity, and also by the specific nature of the problem. 

One issue which is not so problematical with longer sequences is that of sta- 
tistical significance. According to the model of Karlin and Altschul I1105II used 
(in a slightly modified form) in BLAST, short alignments are not statistically sig- 
nificant at the levels routinely used for full sequence analysis - there are too few 
possible alignments between two short fragments . In other words, high scor- 
ing alignments of two short fragments are not unlikely to occur by chance and 
hence the results of searches cannot be immediately assumed to have a biological 
significance. The current attempt towards overcoming this problem is based on 
using the iterative approach to refine the sequence profile and insistence on strong 
conservation among the search results. 

Reliance on similarity search and the vast scale of existing sequence databases 
puts a premium on fast query retrieval that cannot be obtained using existing tools 
such as BLAST, which, at significance levels necessary to retrieve sufficient num- 
bers of hits, essentially reduces to sequential scan of all fragments. Hence it is 
necessary to first develop an index that would speed up the search and to do so it 
is necessary to explore the geometry of the space of peptide fragments. This leads 
to the other central concepts of the thesis: indexing schemes and quasi-metric s. 

1.2 Indexing for Similarity Search 

Indexing a dataset means imposing a structure on it which facilitates query re- 
trieval. Most common uses of databases require indexing for exact queries, where 
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all records matching a given key are retrieved. On the other hand, many kinds 
of databases such as multimedia, spatial and indeed biological, need to support 
query retrieval by similarity - then need to fetch not only the objects that match 
the query key exactly but also those that are 'close' according to some similar- 
ity measure. Hence, substantial amount of research is directed towards efficient 
algorithms and data structures for indexing of datasets for similarity search II130L 

It is not surprising that geometric as well as purely computational aspects such 
as I/O costs are heavily represented in the existing works on indexing for similarity 
search. Indeed, most publications concentrate on the algorithms and data struc- 
tures which can be applied to the datasets which can be represented as vector or 
metric (distance) spaces [I36ll93l . In many cases, the so-called Curse of Dimen- 
sionality ll6T]| is encountered: performance of indexing schemes deteriorates as 
the dimension of datasets grow so that at some stage sequential scan outperforms 
any indexing scheme ||201|9T]| . This manifestation has been linked by Pestov HI 5411 
to the phenomenon of concentration of measure on high-dimensional structures, 
well known from the asymptotic geometric analysis II1381I121| . 

In their influential paper [|87l . Hellerstein, Koutsoupias and Papadimitriou 
stressed the need for a general theory of indexability in order to provide a unified 
approach to a great variety of schemes used to index into datasets for similarity 
search and provided a simple model of an indexing scheme. The aim of this thesis 
is to extend their model so that it corresponds more closely to the existing indexing 
schemes for similarity search and to apply the methods from the asymptotic ge- 
ometric analysis for performance prediction. Sharing the philosophy espoused in 
HISOL that theoretical developments and massive amounts of computational work 
must proceed in parallel, we apply some of the theoretical concepts to concrete 
datasets of short peptide fragments. In that way we both demonstrate important 
theoretical and practical techniques and obtain an efficient indexing scheme which 
can be used to answer biological questions. 
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1.3 Quasi-metrics 

One of the fundamental concepts of modem mathematics is the notion of a metric 
space: a set together with a distance function which separates points (i.e. the 
distance between two points if and only if they are identical), is symmetric 
and satisfies the triangle inequality. The theory of metric spaces is very well 
developed and provides the foundation of many branches of mathematics such as 
geometry, analysis and topology as well as more applied areas. In many practical 
applications, it is to a great advantage if the distance function is a metric and 
this is often achived by symmetrising or otherwise manipulating other distance 
functions. 

A quasi-metric is a distance function which satisfies the triangle inequality but 
is not symmetric. There are two versions of the separation axiom: either it remains 
the same as in the case of metric, that is, for a distance between two points to be 
they must be the same, or, it is allowed that one distance between two different 
points be but not both. In all cases the distance between two identical points 
has to be 0. Hence, for any pair of points in a quasi-metric space there are two 
distances which need not be the same. Quasi-metrics were first introduced in 
1930s [|212[| and are a subject of intensive research in the context of topology and 
theoretical computer science UllSL 

While much of the results from the theory of metric spaces transfer directly 
to the quasi-metric case, there are some concepts which are unique to the quasi- 
metrics, the most important being the concept of duality. Every quasi-metric has 
its conjugate quasi-metric which is obtained by reversing the order of each pair of 
points before computing the distance. Existence of two quasi-metrics, the original 
one and its conjugate leads to other dual structures depending on which quasi- 
metric is used: balls, neighbourhoods, contractive functions etc. We distinguish 
them by calling the structures obtained using the original quasi-metric the left 
structures while the structures obtained using the conjugate quasi-metric are called 
the right structures. The join or symmetrisation of the left and right structures 
produces a corresponding metric structure. 
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Another important concept which has no metric counterpart is that of an as- 
sociated partial order. Every quasi-metric space can be associated with a partial 
order and every partial order can be shown to arise from a quasi-metric. Hence, 
quasi-metrics are not only generalised metrics, but also generalised partial orders. 
This fact has been important for the theoretical computer science applications and 
also has significance in the context of sequence based biology. 

While the topological properties of quasi-metric and related structures have 
been extensively investigated II118L much less is known about the geometric as- 
pects. We therefore aim to extend the concepts from the asymptotic geometric 
analysis to quasi-metric spaces in order to have results analogous to those involv- 
ing metric spaces as well as to investigate the phenomena specific to the asymmet- 
ric case. Such results can then be applied to the theory of indexing for similarity 
search and its applications to sequence based biology. 

1.4 Overview of the Chapters 

Chapter [2] introduces quasi-metric spaces and related concepts. The emphasis is 
on the notions used in the subsequent chapters as well as on examples. In the last 
section, we construct examples of universal quasi-metric spaces of some classes. 
A universal quasi-metric space of a given class contains a copy of every quasi- 
metric space of that class and satisfies in addition the ultrahomogeneity property. 
This notion is a generalisation of a well known concept of a universal metric 
space first constructed by Urysohn [I191II . While there are no direct applications of 
universal quasi-metric spaces in this thesis, our construction serves two purposes: 
it provides examples of quasi-metric spaces not previously known and sets the 
foundations for possible further research mirroring the investigations [|193[ 11981 
I156II relating to the universal metric spaces and their groups of isometrics. 

Chapter[3]explores in detail the connections between biological sequence sim- 
ilarities and quasi-metrics. The main result is the Theorem 13.5.51 which shows that 
local similarity measures on biological sequences can be, under some assumptions 
frequently fuUfilled in the real applications, naturally converted into equivalent 
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quasi-metrics. While it was long known that global similarities can be converted 
to metrics or quasi-metrics, it was believed II178II that no such conversion exists 
for the local case, at least with respect to metrics. 

Chapter m introduces the central mathematical object of this study: the quasi- 
metric space with measure, or pq-space. This is a generalisation of a metric space 
with measure or an mm-space which provides the framework for study of the 
phenomenon of concentration of measure on high dimensional structures. We 
extend these concepts to pq-spaces and point out the similarities and differences 
to the metric case. In particular we study the interplay between asymmetry and 
concentration - the Theorem 14 .6 . 21 indicates that 'a high dimensional quasi-metric 
space is close to being a metric space'. The results from Chapter |4] as well as an 
alternative formulation of the main results from Chapter[3]are published in a paper 
to appear in Topology Proceedings I1181II . 

Chapter [51 partially based on the joint preprint with Pestov [I157L is dedicated 
to applications of the mathematical concepts and results of previous chapters to in- 
dexing for similarity search. We extend, among others, the concepts of workload 
and indexing scheme first introduced by Hellerstein, Koutsoupias and Papadim- 
itriou ll87l in order to make them more suitable for analysis of similarity search 
and apply them to numerous existing published examples. We only consider con- 
sistent indexing schemes - those that are guaranteed to always retrieve all query 
results. Most existing indexing schemes for similarity search can only be applied 
to metric workloads and while quasi-metrics are mentioned in the literature (e.g. 
in ll39l ), no general quasi-metric indexing scheme exists. We therefore introduced 
a concept of a quasi-metric tree and dedicated a separate section to it. Chapter [5] 
also contains a proposal for a general framework for analysis of indexing schemes 
and an application of the concepts developed in Chapter |4] to the analysis of per- 
formance of range queries. 

Chapter [6l building on a second joint preprint with Pestov II182L examines 
some aspects of geometry of workloads over datasets of short peptide fragments 
and introduces FSIndex, an indexing scheme for such workloads. FSIndex is 
based on partitioning of amino acid alphabet and combinatorial generation of 
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neighbouring fragments. Experimental results provide an illustration of many 
concepts from Chapter [5] and show that FS Index strongly outperformes some es- 
tablished indexing schemes while not using significantly more space. It also has 
an advantage that a single instance of FSIndex can be used for searches using 
multiple similarity measures. 

Chapter |7] introduces the prototype of the PFMFind method for identifying 
potential short motifs within protein sequences that uses FSIndex to query datasets 
of protein fragments. Preliminary experimental evaluations, involving six selected 
protein sequences, show that PFMFind is capable of finding highly conserved 
and functionally important domains but needs improvemement with respect to 
fragments having unusual amino acid compositions. 

Appendix 1X1 presents previously unpublished results on estimation of dimen- 
sion of datasets that the thesis author obtained as a summer student at the Aus- 
tralian National University in summer 1999/2000. It takes the concept of distance 
exponent introduced by Traina et al. I1188II and provides it with more rigourous 
foundations. Several computational techniques for computing distance exponent 
are proposed and tested on artificially generated datasets. The best performing 
method is applied in Chapter |6] to estimate the dimensions of two datasets of short 
peptide fragments. 
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Chapter 2 

Quasi-metric Spaces 



In this chapter we introduce the concept of a quasi-metric space with related no- 
tions. A quasi-metric can be thought of as an "asymmetric metric"; indeed by 
removing the symmetry axiom from the definition of metric one obtains a quasi- 
metric. However, we shall adopt a more general definition which has the ad- 
vantage of naturally inducing a partial order. Thus, a notion of a quasi-metric 
generalises both distances and partial orders. 

There is substantial amount of publications about topological and uniform 
structures related to quasi-metric spaces - the major review by Kiinzi HUSH con- 
tains 589 references. In contrast, there is a relative scarcity of works on geometric 
and analytic aspects which is partially being addressed by the recent papers on 
quasi-normed and biBanach spaces ll63l[64l[T60l[65l[66ll . While most known ap- 
plications of quasi-metrics come from theoretical computer science, the aim for 
this thesis is to show that there is a fundamental connection to sequence based 
biology. 

Duality is a very important phenomenon often associated with asymmetric 
structures. The topological aspects of duality are investigated in great detail in 
the paper by Kopperman I1113II . In the case of quasi-metrics, duality is manifested 
by having two structures, which we call left and right, associated with notions 
generalised from metric spaces. The symmetrisation (or a 'join') of these two 
structures corresponds to a metric structure. 
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The present chapter consists mostly of the review of the literature and basic 
concepts illustrated by examples. Our main new contribution is contained in Sec- 
tion l2.8l which introduces universal quasi-metric spaces analogous to the Urysohn 
universal metric spaces first introduced by Urysohn II191L 

2.1 Basic Definitions 

Definition 2.1.1. Let X be a set. Consider a mapping d : X x X ^ IR+ and the 
following axioms for a\\x,y, z E X: 

(i) d{x, x) = 0. 

(ii) d{x,z) < d{x,y) + d{y,z). 

(iii) d{x, y) = d{y, x) = =^ x = y. 

(iv) d{x,y) = d{y,x). 

The axiom (ii) is known as the triangle inequality, the axiom (iii) is called the 
separation axiom and the axiom (iv) is called the symmetry axiom. 

A function d satisfying axioms (i),(ii) and (iii) is called a Quasi-metric and if 
it also satisfies (iv) it is a metric. A pair (X, d), where X is a set and d a (quasi-) 
metric, is called a (quasi-) metric space . 

For a quasi-metric d, its conjugate (or dual) quasi-metric d* is defined for all 
X, ?/ e X by 

d*{x,y) = d{y,x), 

and its associated metric d^ by 

d^{x, y) = max{d{x, y), d{y, x)}. 
The associated metric is is the smallest metric majorising d. ▲ 

A quasi-metric is a metric if and only if it coincides with its conjugate quasi- 
metric. 
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Remark 2.1.2. A function satisfying axioms (i),(ii) above but not necessarily sat- 
isfying the separation axiom (axiom (iii)) is called a pseudo-quasi-metric and if it 
also satisfies the axiom (iv) it is called a pseudo-metric. We use the generic term 
distance to denote any of the pseudo-quasi-metrics. 

If a distance is allowed to take values in IR+ U {oo} (the extended half-reals), 
it is called an extended distance depending on the other axioms satisfied (e.g. 
extended pseudo-quasi-metric). 

Another often used symmetrisation of a quasi-metric is the 'sum' metric 
where for each x, y G X 

d^'i.x.y) = d{x,y) + d{y,x). 

We now summarise some standard notation. 

Definition 2.1.3. Let (X, d) be a quasi-metric space, x e X, A, B C X and 
£ > 0. Denote by 

• diam(74) := sup{(i(a;, y) : x,y E A}, the diameter of set A; 

• 23f'(a;) := {y E X : d{x, y) < e}, the left open ball of radius e centred at x; 

• 23f (x) := {y E X : d{y, x) < e}, the right open ball of radius e centred at x; 

• 53e(x) := {y E X : d^{x, y) < e}, the associated metric open ball of radius e 

centred at x; 

• d{x, A) := inf{(i(x, y) : y E A}, the left distance from x to A; 

• d{A, x) := mi{d{y, x) : y E A}, the right distance from x to A; 

• d^{A, x) := inf {(i^(x, y) : y E A}, the associated metric distance from x to A; 

• A]^ := {x E X : d{A, x) < e}, the left e-neighbourhood of A; 

• Af := {x E X : d{x, A) < e}, the right e-neighbourhood of A; 

• A^ := {x E X : d^{A,x) < e}, the associated metric e-neighbourhood of A. 
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• d{A, B) := inf y) : x e A, y e B}, the distance between A and B. 

▲ 

The left balls , distances, and neighbourhoods coincide with the right versions 
in the case of metric spaces. 

Remark 2.1. A. Our notation in some cases slightly differs from that adopted in the 
literature. We use to denote the associated metric (and later the norm associated 
to a quasi-norm) in order to avoid any confusion that can arise from the more usual 
symbols or d^ . Also note that we denote the open balls by 03 while we shall 
use !B to denote a Borel cr-algebra of measurable sets and to denote the set of 
blocks of an indexing scheme. The notation is our own - 'u' is the second letter 
of the word 'sum' and 's' was already used. 

Remark 2.1.5. We shall often (but not always) use xV yto denote max{x, y} and 
X Ay to denote min{x, y}. 

The following result generalises the triangle inequality to the distances from 
points to sets. 

Lemma 2.1.6. Let (X, d) be a pseudo-quasi-metric space. Then for all x,y E X 

and A (Z X, 

d{x,A) < d{x,y) + d{y,A). 

Proof. By the triangle inequality, for all z E A, d{x,z) < d{x,y) + d{y,z). 
Taking infimum over all z E A of both sides of the inequality produces the desired 
result. □ 

Definition 2.1.7. Let {X, dx) and {Y, dy) be two quasi-metric spaces. A map 
: X — * F is called a {quasi-metric) isometry if </) is a bijection and for all 

x,y eX, 

dY{(p{x),(p{y)) = dx{x,y). 

▲ 

Lemma 2.1.8. Let (p : X Y be an isometry between quasi-metric spaces 
(X, dx) and (Y, (iy). Then (p is also an isometry between metric spaces (X, d\) 
and (F, d^). □ 
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2.2 Topologies and quasi-uniformities 

Each quasi-metric d naturally induces a topology 7{d) whose base consists of all 
open left balls centred at any x E X, of radius e > 0. This is a base 

indeed. Take any x,y G X and e,6 > such that *B^(x) n ^siv) 0- For 
any z G ^^{x) fl ^s{y) set ( = mm{e — d{x, z),6 — d{y, z)} and observe that 
C <B^(x) n<B|'(x). 




( = min{£ — d{x, z),6 — d{y, z)} 



Figure 2.1: Left open balls form a base for a quasi-metric topology. 



Thus, a set U is open if for each x E U there is an £ > such that 23^ (x) C U. 
The topology T(d*) is defined in similar way: its base consists of all open right 
balls (x) of radius e > 0. Hence, one can naturally associate a bitopological 
space {X, 7{d), 7{d*)) to a quasi-metric space (X, d). The relationships between 
quasi-metric and bitopological spaces are well researched II118L 

Definition 2.2.1. A topological space is quasi-metrisable if there exists a quasi- 
metric d such that 7 = '7{d). k 

Remark 2.2.2. Note that for any quasi-metric space {X,d), *Be(x) = ^^{x) fl 
*Bf (x) and hence the base of the metric topology 7{d^) consists exactly of in- 
tersections of left and right open balls of the same radius, centred at any point. 
Therefore, 7{d^) is the supremum of 7{d) and 7{d*): 

7{d') =7{d)V7{d*). 
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Not every topology is induced by a quasi-metric, however Kopperman HI 1211 
showed that every topology on a space X is generated by a continuity function; 
that is, an analogue of a quasi-metric which takes values in a semigroup of a 
special kind called a value semigroup. The question of which topologies are 
quasi-metrisable (i.e. can be induced from a quasi-metric) has been long open. 
We mention the characterisations by Kopperman [|114[ in terms of bitopological 
spaces and by Vitolo II200II (see Corollary 12.5. 121) in terms of hyperspaces of met- 
ric spaces. 

The topology T(d) induced by a quasi-metric d clearly satisfies the Tq separa- 
tion axiom. The induced topology is Ti if and only if d also satisfies the property 
d{x, y) = =^ X = y for all x,y E X. Often in the literature, the Tq quasi- 
metric is called the pseudo-quasi-metric while the name quasi-metric is reserved 
only for the Ti case HTllllSII . The definition presented here is also widely used 
II161[|2011 and comes mostly from computer science applications where the asso- 
ciation with partial orders justifies consideration of the To quasi-metrics. Partial 
orders also arise naturally in the context of biological sequences which are the 
main objects of study of this thesis. 

Definition 2.2.3. A partial order on a set X is a binary relation <C X x X which 
is reflexive, antisymmetric and transitive, that is, 

(i) for all a; G X, a; < x. 

(ii) for all X, 2/ G X, a; < y A y < x =^ x = y- 

(iii) for all X, y , z e X , X < y A y < z =^ x < z. 



Definition 2.2.4. Let (X, d) be a quasi-metric space. The associated partial order 
<d is defined by 



▲ 




A 
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It is easy to see that <d is indeed a partial order and hence one can associate a 
partial order to every quasi-metric. The converse is also true. 

Example 2.2.5 ( 1111911 ). Let {X, <) be a partially ordered set and for any x,y e X, 

set d{x, y) = if X < y and d(x, y) = 1 otherwise. It is clear that d is a quasi- 
metric and that <d coincides with <. The topology 7{d) induced by d is called 
the Alexandroff topology. The metric associated to d is the discrete, that is {0, 1}- 
valued, metric (c.f. the Example [2 . 2 . 8 1 below) . 

Quasi-metrics also generate the so-called quasi-uniformities which are unifor- 
mities but for the lack of symmetry ll57l . More formally, a quasi-uniformity XL on 
a set X is a non-empty collection of subsets of X x X, called entourages (of the 
diagonal), satisfying 

1 . Every subset of X x X containing a set of U belongs to IX; 

2. Every finite intersection of sets of IX belongs to U; 

3. Every set in XL contains the diagonal (the set {{x,x) | x G X}); 

4. If U belongs to IX, then exists \^ in IX such that, whenever (x, , {y^z) G V, 
then (x, z) G U. 

Axioms 1 and 2 mean that IX is a filter. Any collection B of entourages sat- 
isfying 3, 4 and which is a prefilter (that is, for each A, 5 G B there is a (7 G B 
with C C A n 5) generates a quasi-uniformity IX which is the smallest filter on 
X X X containing B. In this case, B is called a basis of IX. 

Definition 2.2.6. A pair of the form (X, IX) where X is a set and IX is quasi- 
uniformity on X is called a quasi-uniform space. A 

Let (X, It) and (F, V) be quasi-uniform spaces. A function / : X — *■ y is 
called quasi-uniformly continuous iff for each V eV, f^^{V) G It. This exactly 
mirrors the notion of uniformly continuous function between uniform spaces. 

Let (X, (i) be a quasi-metric space. Denote by X,. = {{x,y) \ d{x,y) < r} 
the entourage of radius r > 0. The quasi-metric quasi-uniformity IX on X has 
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as a base the set all entourages of radius r > 0, that is, U G U 



3r G 



]R+ : Nr C U. The dual (conjugate) quasi-uniformity U* is generated by the 
entourages N* = {{x,y) \ d{y,x) < r} and the symmetrisation IX^ = U V IX* 
produces a uniformity. It is easy to see that for any quasi-metric, the uniformity 

is equivalent to the uniformity generated by the associated metric d^. 

We now recall parts of the basic theory of completions of quasi-metric spaces. 
All statements are particular cases of corresponding statements for quasi-uniformities. 

Recall that a sequence xi, X2, . . . of points in a metric space (X, p) is Cauchy 
if for every £ > there exists G N such that for all i, j > N, p{xi, Xj) < e. A 
metric space (X, p) is complete if every Cauchy sequence is convergent in X. 

Definition 2.2.7. A quasi-metric space (X, d) is called bicomplete if the associ- 
ated metric space (X, d^) is complete. A 

The theory of bicomplete quasi-uniformities was developed in ll44l and [11240 . 
It is well known that every quasi-metric space (X, d) has a unique (up to a quasi- 
metric isometry) bicompletion (X, d) such that (X, d) is a bicomplete extension of 
(X, d) in which (X, d) is T((i) -dense. The associated metrics {d) and d^ coincide 
so (X, d) is also T((i®)-dense in X. Furthermore, if £) is a 7'((i)-dense subspace 
of a quasi-metric space (X, d) and / : {D,d\o) (XiP) is a quasi-uniformly 
continuous map where (F, p) is a bicomplete quasi-metric space, then there exists 
a (unique) quasi-uniformly continuous extension / : X ^ F of /. 

Apart from the above definition there are in existence more restricted notions 
of completeness of quasi-metric and quasi-uniform spaces developed by Doitchi- 
nov [|49l[5Tl[50l . which we will not use in this work. 

We now present some well-known examples of quasi-metric spaces. 

Example 2.2.8. Let X be any set and set d : X x X ^ M by: 



It can be easily checked that c? is a metric and such metric is called the discrete 
metric. The topology induced by d is discrete: every singleton is open. 








1, 



\fx = y 
if X ^ y. 
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Next we define the quasi-metrics on M generating the so-called upper and 
lower topology. 

Definition 2.2.9. The left quasi-metric : M x M — IR+ is given by 

u^{x, y) = max{x — y, 0}. 
Similarly, define the right quasi-metric : M x M ^ IR+ by 

u^{x, y) = max{y — x, 0}. 

▲ 

It is trivial to show that and are quasi-metrics which are conjugate to 
each other. The associated metric u = max{n^, u^} is the canonical absolute 
value metric on M given by u{x,y) = \x — y\. The base for the left topology 
T(m^) consists of all sets of the form oo) and the base for the right topology 
T(n^) of all sets of the form {—oo, where ^ G M. Hence T(n^) and T(n^) are 
To but not Ti separated. The partial order associated with (in this case a linear 
order) is the usual order on reals, while induces the reverse order. 

For any topological space {X, T), a continuous function (X, T) (M, u^) is 
often called lower semicontinuous and a continuous function (X, T) (M, u^) is 
upper semi-continuous. In accordance with this terminology, 7{u^) is often called 
the topology of lower semicontinuity on reals while T(n^) is called the topology 
of upper semicontinuity. 

Remark 2.2.10. It is worth noting that for any quasi-metric space {X, d), the quasi- 
metric d, taken as a function X x X — > M is lower semicontinuous with respect 
to the product topology 7{d*) x 7(d) and upper semicontinuous with respect to 
the product topology 7{d) x 7{d*). Indeed, let U = {{x,y) : d{x, y) < 5} and 
let V = {{x,y) : d(x, y) > 6}. One can show using the triangle inequality that 

ix,y)eU 

and 

{x,y)£V 
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and hence U is open in 7{d*) x 7{d) and V is open in 7{d) x 7{d*). However, 
d is not in general lower or upper semicontinuous with respect to the product 
topologies 7{d) x 7{d) or 7{d*) x 7{d*). For the counter example, set d = 
and consider neighbourhoods of (0, 0). 

Example 2.2.11 ( ||119[l47l '). Another quasi-metric on is given by 



In this case d induces a Ti topology T on M whose base consists of all left balls 
centred at x G M of the form ^^(x) = [x, x+r), where < r < 1 (for any x G M, 
and r > 1, ^^(x) = M). The topological space (M, T) is called the Sorgenfrey 
line, a well known object in topology and a source of many counter-examples. 
The associated metric d^ is the discrete metric. 

Any unbounded quasi-metric can be converted to a bounded quasi-metric while 
preserving the topology in the following way. 

Example 2.2.12. Let (X, d) be an extended quasi-metric space. Then p : X x 
X M+ defined by 



is a quasi-metric such that T(p) = 7{d). The proof of quasi-metric axioms is 
trivial and the fact that topologies coincide follows from the fact that all open 
balls of radius not greater than 1 coincide. 

Definition 2.2.13. Let (X, 7) be a topological space. Denote by 

• y(X), the set of all subsets of X; 

• yQ{X), the set of all non-empty subsets of X; 

• y^(X), the set of all finite subsets of X; 




min{l,rf(x,?/)}, 



• X(X, T), the set of all compact subsets of X; 
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• %o{X,'J), the set of all non-empty compact subsets of X; 

• ^(X, T), the set of all closed subsets of X; 

• ^o{X,7), the set of all non-empty closed subsets of X. 

If the topology T is generated by a quasi-metric d we will often replace 7 in the 
above expressions by d, for example obtaining %{X, d) for the set of all compact 
subsets of X. 

The set CP(X) (or restrictions as above) with some (topological) structure is 
often called a hyperspace. k 

Example 2.2.14 Let X be a set and let = 7^{X). Define p : ^ M 

by p{A, B) = \A\B\ = \A\ -\Ar]B\. 

It is easy to see that A C B <^=^ p{A, B) = 0. The triangle inequality can be 
verified by noting that A\C = {A\{BUC))U{{Ar]B)\C) C {A\B)U{B\C) 
and hence p is a quasi-metric with the associated order corresponding to the set 
inclusion. The symmetrisation p"{A, B) = \A A B\ = \A\ + \B\ -2\AnB\ 
produces the well-known symmetric difference metric. 

p{A,B) = \A\B\ 




Figure 2.2: Set difference quasi-metric. 



Example 2.2.15. More generally, let {X, E, p) be a measure space and ?\f = 
Sfin/yU, the set of equivalence classes of measurable subsets of finite measure, 
that is, for any A, 5 G S such that p{A) < oo and p{B) < oo, A ~ i? <^=^ 
p{A \ B) = p(B \ A) = 0. Then, by the same argument as above, the function 
p : Jsf X [NT ^ M where p{A, B) = p{A \ i?), is a Tq quasi-metric. 
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Example 2.2.16. Let (Xj, di), i = 1, 2 ... n be quasi-metric spaces and suppose 
X = Xi X X2 ■ . . X Xn, that is, for each x G X, x = {xi,X2. ■ ■ Xn), Xi G Xt. 
Define d: X x X ^Rhy 



Then it is easy to show that {X, d) is a quasi-metric space. We will call the product 
spaces of this kind the ii-type quasi-metric spaces. They will feature extensively 
later on. 

Example 2.2.17. Let X be an fi-type product space as above. The Hamming 
metric is a metric obtained by setting each d^ above to be the discrete metric. In 
other words, 



2.3 Quasi-normed Spaces 

Important examples of quasi-metrics are induced by quasi-norms, the asymmetric 
versions of norms. The research area of quasi-normed spaces has seen a significant 
development in recent years both in theory |l63l [64l 11601 [65l [66ll and applications 
II161[|164]| . We survey here some of the main definitions and examples. 

Recall that a semigroup (X, -k) is a set X with a binary operation -k satisfying 

1 . Va;, y E X, x-ky E X (closure), 

2. Vx, y, z E X, X -k {y -k z) = {x -k y) -k z (associativity). 

A monoid or a semigroup with identity is a semigroup {X, ^^r) containing a unique 
element e E X (also called a neutral element) such that \/x E X, x -k e = e -k 
X = X, and a group {X, k:) is a monoid where each element has an inverse, that 
is, Vx E X,3x~^ E X: X -k x~^ = x~^ -k x = e. A homomorphism from a 
semigroup (X, -k) to a semigroup (F, *) is map : X — > F such that Vx, y E X, 
0(x) * 0(y) = 0(x y). An isomorphism is a homomorphism which is a bijection 
such that its inverse is also a homomorphism. 



n 




i=l 



d{,x,y) = \{i: Xij^ yi}\ . 
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Definition 2.3.1. A semilinear (or semivector) space on IR+ is a triple (X, +, ■) 
such that (X, +) is an Abelian semigroup with neutral element G X and ■ is a 
function IR+ x X ^ X which satisfies for all x,y e X and a, 6 G IR+: 

(i) a ■ {b ■ x) = {ah) ■ x, 

(ii) {a + h) ■ X = {a ■ x) + {h ■ x), 

(iii) a ■ (a; + = (a ■ x) + (a ■ y), and 

(iv) 1 ■ X = X. 

Whenever an element x G X admits an inverse it can be shown to be unique and 
is denoted —x. If we replace in the above definition IR+ with M and "semigroup" 
with "group" we obtain an ordinary vector (or linear) space. ▲ 

Definition 2.3.2 ( 1116411 '). Let (£", +, ■) be a linear space over M where e is the 
neutral element of {E, +). A quasi-norm on is a is a function ||-|| : E ^ ]R_|_ 
such that for all x,y E E and a G IR+: 

(i) ||x|| = ||— = <^==^ x = e, 

(ii) \\a • a;|| = a and 

(iii) ||x + y|| < \\x\\ + \\y\\. 

The pair {E, || ■ || ) is called a quasi-normed space. k 

It is easy to verify that the function || -Undefined on by = max{||x|| , ||— 
is a norm on E. 

The quasi-norm ||-|| induces a quasi-metric d\\.\\ in a natural way. 

Lemma 2.3.3. Let {E, ||-||) be a quasi-normed space. Then d\\.\\ defined for all 

x,y E E by 

dH{x,y) = \\y - x\\ 
is a quasi-metric whose conjugate dt,, is given by dt,,, {x, y) = \\x — y\\. 
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Proof, het x,y,z G E. We have (x, x) = — x|| = ||e|| 
y) =d||.||(y,x) = it follows by the first axiom that ||y — a; 
and hence x — y = e, that is x = y. 
For the triangle inequality we have 

d\\-\\{x,y) + dii.ii{y,z) = \\y - x\\ + \\z-y\\ 
> \\y — X + z — y\\ 

— Ik ^ ^11 

= c/||.|| (x, z) as required. 

The statement about the conjugate is obvious. □ 

Definition 2.3.4 ( 1116411 ). A quasi-normed space {E, || ■ || ) where the induced quasi- 
metric is bicomplete is called a biBanach space. A 

Example 2.3.5. A quasi-norm on M is given for all x G M by ||x|| = max{x, 0}. 
It is easy to show that (Definition l2.2.9l) is induced by the above quasi-norm. 

Example 2.3.6 ( HI 6411 ). Let {E, ||-||) be a quasi-normed space. Define 

oo 

'Bl = {f:n^E \ 5^2-"||/Hr<oo}. 

n=l 

The set !B|; can be made into a linear space using standard addition and scalar 
multiplication of functions. Set the quasi norm for each / G by 

oo 

11/11^,= J]2-"||/(n)||. 

n=l 

Then, the space (!B^, is a quasi-normed space and is a biBanach space if 

Eisdi biBanach space. 

We conclude this section by considering quasi-normed semilinear spaces and 
the dual complexity space. 



0. Also if 

11^; - y\\ = 
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Definition 2.3.7 ( II164II ). A quasi-normed semilinear space is a pair (F, \\-\\p) 
such that F is a non-empty subset of a quasi-normed space (F, ||-||) with the 
properties that (F, +\f, ■ \f) is semilinear space on IR+ and || ■ is a restriction of 
the quasi-norm ||-|| to F. 

The space (F, is called a biBanach semilinear space if (F, ||-||) is a 
biBanach space and F is closed in the Banach space (F, || ■ ||^). ▲ 

The complexity space and its dual have been introduced and extensively stud- 
ied in the papers by Schellekens HI 6911 and Romaguera and Schellekens I1162[|164ll 
respectively, in order to study the complexity of programs. The example below 
presents the dual complexity space as an example of a quasi-normed semilinear 
space. 

Example 2.3.8 ( II164II ). Let (F, ||-||^) be a quasi-normed semilinear space where 
F is a non-empty subset of a quasi-normed space (F, ||-||). Let 

oo 

e* = {/:N^F|5^2-^||/(n)r<oo}. 

n=l 

It is apparent that C* is a semilinear space and that C* C (Example 12.3.61) . 
Define for each / G C* 

oo 

ll/lle = E2'"ll/HllF 

SO that (C*, ||-||g,) becomes a quasi-normed semilinear space. It associated quasi- 
metric space (C*, ) is called the dual complexity space. 

Section 1241 will present a further example of a quasi-normed semilinear space. 

2.4 Lipschitz Functions 

While the quasi-metric spaces have been extensively studied from a topological 
point of view, the properties of the non-contracting maps between them, also 
called 1 -Lipschitz functions, have not received the same attention. The only 
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widely available reference solely on this topic is the paper by Romaguera and San- 
chis [I161II . In this section we will define left- and right- Lipschitz maps, present 
a few basic results and examples, as well as survey some of the results by Roma- 
guera and Sanchis. Lipschitz maps will be extensively used in subsequent chapters 
and new structures will be introduced where needed. 

Definition 2.4.1. Let (X, d) and (F, p) be quasi-metric spaces. A map f : X ^ Y 
is called left K -Lipschitz if there exists K G IR+ such that for all x,y e X 



The constant K is called a left Lipschitz constant. Similarly, / is right K -Lipschitz 



Left-Lipschitz functions are commonly called semi-Lipschitz 1116111 but we use 
the above nomenclature in order to be consistent with the other "one-sided" (left- 
or right-) structures we introduced. Indeed, it is easy to note that every left K- 
Lipschitz map {X, d) (Y, p) is right A'-Lipschitz as a mapping (X, d*) 



Lemma 2.4.2. Let (X, d) and (F, p) be quasi-metric spaces and let f : X Y 
be a left 1 -Lipschitz map. Then f is continuous with respect to the left topologies 
on both spaces. 

Proof. Take any e > 0. We need to show that there is 5 > such that for any 
y eY and X e X, f-\^^{y)) D 23|'(x). Pick S = e- p{y, /(x)). It follows 
that for any z G ^^{x). 



p{fix)J{y))<Kdix,y). 




▲ 



p{y,f{z))<p{y,f{x))+p{f{x),f{z)) 

< p{yJ{x)) + p{x, z) 

< p{y,f{x)) + 6 = e. 



□ 
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2.4.1 Examples 

From now on we will concentrate on the maps from a quasi-metric space {X, d) to 
(M, u^). Recall that the quasi-metric is given by u^{x, y) = max{x — y, 0} = 
X — yy 0. The following is an obvious fact. 

Lemma 2.4.3. Let (X, d) be a quasi-metric space and f : {X, d) — > (M, u^) a 
left K -Lip schitz function. Then, g : (X, — > (M, m^) where g = —f is a right 
K-Lipschitz function. □ 

Unless stated otherwise, we will consider as the canonical quasi-metric on 
M. The main examples of Lipschitz functions are, as in the metric case, distance 
functions from points or sets, as well as sums of such functions. For each example 
both a left- and a right- 1 -Lipschitz function will be produced but the proofs will 
be presented only for the left case since the right case would be follow by duality. 

Lemma 2.4.4. Let (X, d) be a quasi-metric space and y & X. Then the function 
dy : X ^ "K, where 

dy{x) = d{x,y), 
is left 1 -Lipschitz and the function d* : X ^ M., where 

d*y{x) = d{y,x), 

is right 1 -Lipschitz. 

Proof. Let x,z e X. Then dy{x) — dy{z) = d{x,y) — d{z,y) < d{x, z) by the 
triangle inequality. Similarly, d*{z) — dy{x) = d{y, z) — d{y, x) < d{x, z). □ 

Lemma 2.4.5. Let (X, d) be a quasi-metric space and A C X. Then (i^ : X — > R, 
where 

dA^x) = d{x, A), 
is left 1 -Lipschitz and d\: X ^M., where 

d\{x) = d{A, x), 



is right 1 -Lipschitz. 
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Proof. Let x,y G X. Then 

d{x, y) + dAiy) = d{x, y) + mf {d{y, w)} 
= mi{d{x,y) + d{y,w)} 

> mi{d{x,w)} by the triangle inequality 

= (iyi(x). □ 

Lemma 2.4.6. Let {X,d) be a quasi-metric space, a finite collection of 

left (right) 1 -Lip schitz functions X — > M and {Aj}"^^ a collection of coefficients 
such that \i > Ofor all i = 1, 2 ... n and Yl^=i -^i = 1- Then, 

n 
i=l 

is left (right) 1-Lipschitz. 

Proof. We prove the left case only. 

n n 

f{x)-f{y) = Y,^d^i^)-Y.^'d^iy) 

i=l i=l 
n 

= Y^\(f^(x) - f,{y)) 

i=l 
n 

< ^ Ai d(x,y) 

i=l 

= d{x,y). □ 

In particular, for any collection {fi}^^i of left 1-Lipschitz functions, the nor- 
malised sum / = ^ Sr=i fi ^^f*- 1-Lipschitz. 

2.4.2 Quasi-normed spaces of left-Lipschitz functions and best 
approximation 

Another example of a semilinear quasi-normed space was produced by Roma- 
guera and Sanchis [I161II who constructed a quasi-normed semilinear space of left 
Lipschitz functions. 
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Denote by §LQ{d) the set of all left Lipschitz functions on a quasi-metric space 
{X, d) that vanish at some fixed point xq. We can define for all f,gE §Jio{d) and 
a e IR+ the sum f + g and scalar multiple a ■ / in the usual way, producing a 
semilinear space {SLo{d), +, ■) on IR+. 

Also, the function ||.||^ : §Lo(d) IR+ defined by 

(/(x)-/(y))VO ^ 
ll/L= sup — — <oo 

is a quasi-norm on §£jo{d) and hence (S£jo('^)) IMD forms a quasi-normed semi- 
linear space. 

Theorem 2.4.7 ( 1116111 ). The function pd : §£o('^) x S'Cjo('^) w/zere 

Pd{f,g)= sup r 

dix,y)^o d{x,y) 

is a bicomplete extended quasi-metric on §£o(c^)- D 

Recall that a set S in a linear space E is convex if and only if for any collec- 
tion xi,X2 . . .Xn G S and Ai, A2, . . . A„ G IR+ such that Yll^=i = 1' we have 
^"=1 Aj Xi E S. This definition can be extended to semilinear spaces and hence, 
by the Lemma [2.4.6[ the set of 1 -Lipschitz functions vanishing at a fixed point is 
a convex subset of §£jo{d). 

Best approximation 

From now on to the end of this section let (X, d) be, as before, a quasi-metric 
space and denote by clx{y} the closure {x : d{x, y) = 0} of the subset {y} in 
the topology 7{d). Let Y C X, p E X and denote by Py (p) the set of points of 
best approximation to p by elements ofY, that is: 

PY{p) = {yoeY : d{p,Y) = d{p,yo)} 



Theorem 2.4.8 ( [fTFTH ). Let p ^ [J{dx{y} \ y E Y} and let M C Y. Then 
M C Pyip) if and only if there exists f E §Jio{d) such that 
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1- ll/L = i. 

2. f\Y = 0, and 

3. d{p, y) = f{p) - f{y)for all y e M. □ 

Furthermore, define Yq = {f E S^oid) and /|y = 0}, and for each x,y e X 
such that d{x, y) 7^ set 

, , , (/(x) -/(!/)) vo l 
dvoix, y)= sup ifeYo: t-zt > . 

ll/lldT^O I 11/ lid J 

Theorem 2.4.9 (MB)- Letp^Y and let M C Y. Then M C Py (p) i/anJ only 
ifdvoiP, y) = d{p, y)for all y e M. □ 

2.5 Hausdorff quasi-metric 

Asymmetric variants of the Hausdorff metric provide further examples of quasi- 
metric s. 

Definition 2.5.1. Let {X, p) be a metric space. A map pn ■ 3Co(X, p)xXo{X, p) 
]R+ defined by 

Ph{A,B) = maxjsup p(a, 5), sup p{b, A)}, 
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Figure 2.4: Hausdorff distance between two sets. 



is called the Hausdorff metric. 



▲ 



Remark 2.5.2. An equivalent, more geometric way would be to define 

Ph{A, B) = inf{£ >0: AC B,AB C A,}. 

In other words, Ph{A, B) is the infimal e > such that for every 5 > 0, A is 
contained in the [e + 5) -neighbourhood of B and B is contained in the [e + 5)- 
neighbourhood of A (Fig. 12.51) . 

At this stage we omit the proof that Hausdorff metric is indeed a metric on 
%q{X,p) since it follows from the properties of the Hausdorff quasi-metric de- 
fined below. 

Definition 2.5.3. Let (X, d) be a pseudo-quasi-metric space. Denote by c/^, c/^, 
and dn, the maps 7q{X) x 7q{X) ^ M+ U {oo} where for all A,B e To(^), 



sup d{a, B), 



dniA.B) 



sup d{A, b), 

b£B 



and 



dH{A,B) 



meix{djj{A,B), dJj{A,B)}. 



▲ 
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Lemma 2.5.4. Let {X, d) be a pseudo -quasi-metric space. Then djj, djj, and dn 
are extended pseudo-quasi-metrics. 

Proof. It is obvious that for any A G yo{X),d'\j{A, A) = dJ^iA, A) = dH{A,A) = 
as (i is a pseudo-quasi-metric. To prove the triangle inequality let A,B,C E 
7o{X). Take any a E A,b E B. By the Lemma [2.1.6l we have 

d{a,C) < d(a,b) + d{b,C) 

< d{a, h) + C), by the definition of djj. 

Hence, d{a, C) < d{a, B) + d'jj(B, C) and by taking supremum over a E Aon 
both sides we get d'\j{A, C) < dH{A, B) + d'^{B, C) as required. 

The statement for djj follows by the same argument once we note that dJj{A, B) = 
supfegB d{A, h) = supftg^ d*{h, A). It is obvious that if both d'\j and d~^ satisfy the 
triangle inequality then du does as well. □ 

Lemma 2.5.5. Let (X, d) be a quasi-metric space with p = d^, the associated 
metric. Then for any A, B G J'o(X) 

p+ (A, B) = max{djj{A, B), dJj{B, A)} and 
pJj{A,B) = max{dJj{A,B), d+{B,A)} 

Proof. The result follows straight from the definition. 

msix{dff{A, B),dJj{B, A)} = sup max{(i(a, 5), d{B,a)} 

= sup p(a, B) 

Similarly, max{dJj{A, B), d'jj{B, A)} = sup^g^ p{A, b) = pJj{A, B). □ 

Lemma 2.5.6. Let {X, d) be a quasi-metric space. Then dn restricted to (-^7 d) 
is an extended quasi-metric and restricted to %q{X, d) is a quasi-metric. 
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Proof. To show is an extended quasi-metric, only the separation axiom needs 
to be proven as the rest follows by the Lemma 12.5.41 

Suppose A,B e %{X,d) and dniA^B) = dH{B,A) = 0. Let p = d'. By 
the Lemma [2331 we have p+ (A, B) = pJ^^A, B) = 0. Now, if p+{A, B) = 0, 
then for all a E A there exists a b E B such that p{a, h) = as i? is closed, 
implying a = b since p is a metric. Hence, pjj{A,B) = =^ A C B. 
Similarly, Ph{A,B) = =^ B C A as Ph{A,B) = d+{B,A). Therefore, 
dffiA, B) = dniB, A) = implies A = B. 

If A, B E %o{X,d), for any a E A, the function a t-^ d{a, B) is left 1- 
Lipschitz (Lemma [2. 4. 5 1) , hence continuous (Lemma [2.4.2l) and bounded since A 
is compact. Hence dniA, B) < oo and thus dn is a quasi-metric. □ 

We are therefore justified to state the following 

Definition 2.5.7. Let (X, d) be a quasi-metric space. The map dn restricted to 
"ifoiX, d) is called a Hausdorjf extended quasi-metric and restricted to %q{X, d) 
is called a Hausdorjf quasi-metric. A 

Corollary 2.5.8. Let (X, d) be a quasi-metric space. The Hausdorjf metric over 
%q{X, d^) restricted to %q{X, d) is the metric associated to the Hausdorjf quasi- 
metric over %o{X, d). 

Proof. Follows from the Lemmas l2.5.5l and [2. 5. 6[ □ 

A stronger statement for c?^ and d'Jj is possible if the underlying space is Ti- 
separated. 

Lemma 2.5.9. Let (X, d) be a Ti quasi-metric space. Then and qjj, restricted 
to ^o{X, d), are extended quasi-metrics whose associated orders correspond to 
set inclusion. They are quasi-metrics if they are restricted to "Xq^X, d). 

Proof. As in Lemma l2.5.6[ we only need to prove separation - the rest follows by 
the Lemma [2.5.41 Take any A, B E ^oiX, d) and suppose qH{A, B) = 0. Then, 
for all a G A and for all e > 0, there is ah E B such that d{a, b) < e. Since B 
is closed, there exists a b^ E B such that d{a, b^) = and therefore a = bo as 
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d satisfies the Ti separation axiom. Thus A C B <^=^ d'jj{A,B) = and it 
immediately follows that the associated order is set inclusion and that dniA, B) = 
dH{B,A) = A = B. 

If A, B E %o{X,d), for any a E A, the function a i-^ d{a, B) is left 1- 
Lipschitz (Lemma [2331), hence continuous (Lemma [2.4.21) and bounded since B 
is compact. Hence d'jj{A, B) < oo. 

The statements for djj follow by duality. □ 

Remark 2.5.10. The assumption that d satisfies the Ti separation axiom is indeed 
necessary for separation. Consider the following example of a general quasi- 
metric space where the q^{A, B) = qjj{B, A) = no longer implies A = B. 

Let X = {a, b, c} and define a quasi-metric q by g(a, a) = q{b, b) = q{c, c) = 
q{a,b) = q{c,b) = and q{a,c) = q{b,a) = q{b,c) = q{c,a) = 1. Let A = 
{a, b} and B = {b, c}. It can be easily verified (Figure 12.51) that q is indeed a 
quasi-metric on X and that qjjiA, B) = qjjiB, A) = but A 7^ B. 




Figure 2.5: Illustration of Remark l2.5. 101 

The construction above was observed by Berthiaume |[T8l in a more general 
context of quasi-uniformities over hyperspaces of quasi-uniform spaces. There 
exist alternative definitions of Hausdorff quasi-metric. Vitolo II200II defines an 
(extended) Hausdorff quasi-metric over the collection of all nonempty closed 
subsets of a metric space (X, d) by 



ed{A,B) = sup d{a, B), 
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that is, in our notation, his quasi-metric corresponds to d~^. We now briefly survey 
his application of this quasi-metric to quasi-metrisability of topological spaces. 

Theorem 2.5.11 (Vitolo HIOOID . Every (extended) quasi-metric space embeds into 
the quasi-metric space of the form {^o(Y, p), pjj), where (Y, p) is a metric space. 

□ 

Let (X, d) be a quasi-metric space. The proof involves construction of the 
space Y = X X with the metric p where 

p{{s,a),{t,f3)) = d%s,t) + \a-f3\ 
for aU (s, a), (t, (3) G Y. The mapping E : X ^ "rfoiY, p) where 

E{z) = {{y,r,)^X:d{y,z)<r^} 
produces the required embedding. 

Corollary 2.5.12 (Vitolo 020011 ). A topological space is quasi-metrisable if and 
only if it admits a topological embedding into a hyperspace. □ 

2.6 Weighted quasi-metrics and partial metrics 

Our main example of a quasi-metric comes from biological sequence analysis. 
It turns out that the similarity scores between biological sequences can often be 
mapped to a more restricted class of quasi-metrics, the weighted quasi-metrics 
[|119[ 12011 . or equivalently, the partial metrics [I133B . Chapter [3] presents the full 
development of the biological application while the present section surveys the 
mathematical theory that was originally developed in the context of theoretical 
computer science. 

2.6.1 Weighted quasi-metrics 

Definition 2.6.1 ( 111191 1201II '). Let {X, d) be a quasi-metric space. The quasi- 
metric d is called a weightable quasi-metric if there exists a function w : X ^ 
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M_i_, called the weight function or simply the weight, satisfying for every x,y e X 

d{x, y) + w{x) = d{y, x) + w{y). 

In this case we call d weightable by w. 

A quasi-metric d is co-weightable if its conjugate quasi-metric d* is weightable. 
The weight function w by which d* is weightable is called the co-weight of d and 
d is co-weightable by w. 

A triple {X, d, w) where {X, d) is a quasi-metric space and w a function X — > 
]R+ is called a weighted quasi-metric space if (X, d) is weightable by w and a 
co-weighted quasi-metric space if (X, c?) is co-weightable by w. 

In all the above, if the weight function w takes values in M instead of IR+, the 
prefix generalised is added to the definitions. A 

Not every quasi-metric space is weightable II133II but each metric space is obvi- 
ously weightable, admitting constant weight functions. If (X, (i, w) is a weighted 
quasi-metric space then so is (X, d,w -\- C) where C > 0. 

Definition 2.6.2 ( ifTTOl ). Let X be a set. A function / : X ^ IR+ i?, fading if 
infj-gx f{x) = 0. A weighted quasi metric space (X, d, w) is of fading weight if 
its weight function is fading. A 

Lemma 2.6.3 ( I1119II , 1117011 ). The weight functions of a weightable quasi-metric 
space are strictly decreasing (with respect to the associated partial order). These 
are exactly the functions of the form f + C, where C > and where f is the 
unique fading weight of the space. 

Example 2.6.4. The set-difference quasi-metric on finite sets (Example l2.2.14l) is 
co-weightable with a co- weight assigning to each set A its cardinality \A\. 

Example 2.6.5 ( HI 1911 ). Let X = M+ and set d = u^\m+, the restriction of to 
positive reals (i.e. for any x,y e IR+ d{x, y) = y — x if x < y and d{x, y) = 
if y < x). Set w{x) = x for all x G X. It is easy to verify that (X, d, w) is a 
weighted quasi-metric space and that w is its unique fading weight function. 
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Example 12.6.51 shows that a weightable quasi-metric space need not be co- 
weightable - in that case its weight is unbounded. Further examples are provided 
in [I119II . It is easy to see that a generalised weightable quasi-metric space is 
exactly a space which is weightable or co-weightable. The following result can 
be used to distinguish between weighted and co- weighted quasi-metric spaces. 

Lemma 2.6.6 ( 1111911 . 1120 IIP . Let (X, rf, w) be a generalised weighted quasi-metric 
space. 

• Ifw> mfor all x E X, (X, d,w — m) is a weighted quasi-metric space; 

• Ifw < M for all x E X, (X, d*,M — w) is a weighted quasi-metric space; 

• If {X,d*,u) is a generalised weighted quasi-metric space then w + u is 
constant on X. □ 

Lemma 2.6.7. Let {X, d,w) be a weighted quasi-metric space. Then w is a right- 
1 -Lipschitz function. 

Proof. Let x,y E X. Then w{x) — w{y) = d{y, x) — d{x, y) < d{y, x). □ 

Hence it follows that a weight function w for a weightable quasi-metric space 
{X, d, w) is continuous function X IR+ with regard to the quasi-metric (i.e. 
it is upper semicontinuous). 

Partial topological characterisation of weighted quasi-metric spaces was ob- 
tained by Kiinzi and Vajner II119II . For example, they show that Sorgenfrey line is 
not weightable. The full results of their investigation are out of scope of this thesis 
and we only present a theorem about weightability of Alexandroff topologies. 

Theorem 2.6.8 ( II119II ). Let < be a partial order on a set X and T be the full 
Alexandroff topology on X. 

Then {X, T) admits a weightable quasi-metric if and only if there is a function 
w : X ^ ]R_i_ such that for each x E X there exists > such that for any 
y,z E X with x < y, z < y and x ^ zwe have w{z) — w{y) > l^. □ 
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2.6.2 Bundles over metric spaces 

Vitolo [120 111 characterised weighted quasi-metric spaces as bundles over a metric 
space. 

Definition 2.6.9. Let (X, p) be a metric space. A bundle over (X, p) 1120 lH is the 
weighted quasi-metric space {X x IR+, rf, w) where 

d{{x, 0, (y, V)) = p{x, y)+^-r] 

and 

A 

Theorem 2.6.10 ( 112011 ). Every weighted quasi-metric space embeds into the bun- 
dle over a metric space. □ 

In fact, every weighted quasi-metric space can be constructed from a metric 
space and a non-distance-increasing (1-Lipschitz) positive real- valued function on 
it. If a generalised weighted quasi-metric space is desired, such function can take 
values over the whole real line. 

Theorem 2.6.11 ( 112011 ). Given a metric space (F, p) and a 1-Lipschitz function 
f -.Y ^R+, letG = {{s, f{s)) ■ s eY}be the graph off.Ifd:Y^R is 
defined by 

{{sj{s)),{tjm^p{s,t) + f{t)-f{s) 

then {G, d, 2f) is a weighted quasi-metric space. Moreover, every weighted quasi- 
metric space can be constructed in this way. 

The quasi-metric space {G,d) is Ti-separated if and only if the function f 
above also satisfies 



Ws,teY:sj^t, \f{s)-f{t)\<p{s,t). □ 
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Theorem 2.6.12 ( I1201II ). A quasi-metric space {X, d) admits a generalised weight 
if and only if 

\/x, y,z E X d(x, y) + d{y, z) + d{z, x) = d{x, z) + d{z, y) + d{y, x). 

Furthermore, (X, d) is weightable if and only if it admits a generalised weight 
and for some (equivalently for each) a G X, the set 

Ta = {d{a, x) — d{x, a) \ x E X} 

is bounded below. 

The generalised weight function above is given by 7a(x) = q{a, x) — q{x, a), 
a G X. The statement can be dualised to the co-weightable case and used to 
distinguish weightable and co-weightable quasi-metric spaces. 

2.6.3 Partial metrics 

Matthews [I133II proposed the concept of a partial metric, a generalisation of met- 
rics which allows distances of points from themselves to be non-zero. He then 
showed that partial metrics correspond to weighted quasi-metrics. Partial metrics 
were further developed with a view to the applications in theoretical computer 
science II147[ [30l [3T1 11631 1170II . The greatest relevance of partial metrics in the 
context of this thesis is that similarity scores between biological sequences very 
often correspond exactly to partial metrics. 

Definition 2.6.13 (Matthews 11331). Let X be a set. A map p : X x X ^ M is 

called a partial metric if for any x,y, z E X: 

1. p{x,y) >p{x,x); 

2. x = y p{x,x) = p{y,y) = p{x,y); 

3. p{x,y) =p{y,x); 

4. p{x, z) < p{x, y) + p{y, z) - p{y, y). 
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For a partial metric p its associated partial order <p is defined so that for all 

x,y e X, 



A partial metric p induces a topology 7{p) whose base are the open balls of 
radius £ > of the form {y E X : p{x, y) < p{x, x) + e} ( [I147II ). 

Example 2.6.14 ( lfT33]l ). Let X be any set and Y = X^, the set of all infinite 
sequences of elements of X. The Baire metric is a distance dowY defined for all 
x,y eY by: 



Denote by X* the set of all finite and infinite sequences over X and for each 
finite sequence y E X* denote by |y| its length (we agree that for all y E X^, 
\y\ = oo). The map p : X* x X* ^ R, where for all x,y E X* x X* 



is called the Baire partial metric. It follows that p{x, x) = 2 '^L 
Theorem 2.6.15 ( [fT33]l \ Let X be a set. 

1. For any partial metric p on X, the map q : X x X ^ R where for all 

x,y E X 



is a generalised weighted quasi-metric with weight function w : x t-^ 
p{x, x) such that T(p) = T(g) and <p=<q. 

2. For any (generalised) weighted quasi-metric q over X with weight function 
w, the map p : X x X ^M. where for all x,y E X 




A 



p{x,y) 




q{x,y) = p{x,y) -p{x,x) 



pix,y) = q{x,y) + w{x) 



is a partial metric such that 7{q) = 7{p) and <q=<p. 



□ 
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2.6.4 Semilattices, semivaluations and semigroups 

In this subsection we review the results of Schellekens II170II and Romaguera and 
Schellekens [I165II about the weightable quasi-metrics on semilattices and semi- 
groups. These are, in the context of lattices, also mentioned in 111471 l30l [3TI . 
Again, the motivation comes from biological sequences, which are also instances 
of semigroups. 

Definition 2.6.16. Let (X, <) be a partial order. Then {X, <) is called a join 
semilattice if for every x,y E X there exists a supremum, denoted x U y and a 
meet semilattice if for every x,y E X there exists an infimum, denoted x Hy. A 
lattice is a partial order which is both a join and a meet semilattice. A 

Definition 2.6.17. If (X, <) is a join semilattice then a function / : (X, <) M_|_ 

is a join valuation iff for all x,y, z E X 

f{xUz)<f{xUy) + f{yUz)-f{y) 

and / is a join co-valuation iff for all x,y, z E X 

f{xUz)>f{xUy) + f{yUz)-f{y). 

If {X, ^) is a meet semilattice then a function / : {X, ^) IR+ is a meet 
valuation iff for 3l\x,y, z E X 

f{x nz)> f{x ny) + f{y Hz)- f{y) 

and / is a meet co-valuation iff for all x,y,z E X 

fix nz)< fix ny) + fiy Hz)- fiy). 

A function is a semivaluation if it is either a join valuation or a meet valuation. 
A semivaluation space is a semilattice equipped with a semivaluation. ▲ 

Definition 2.6.18. A quasi-metric space (X, d) is called a join (meet) semilattice 
quasi-metric space if its associated partial order is a join (meet) semilattice. ▲ 
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Equivalently, a quasi-metric space (X, d) is a join semilattice if for all x,y E 
X there exists a z G X such that d{x, z) = and d{y, z) = and a meet semilat- 
tice if for all x,y e X there exists az e X such that x) = and y) = 0. 

Definition 2.6.19. A join semilattice quasi-metric space (X, c?) is called invariant 
if for all x,y, z G X d{x Uz,y\Jz) < d{x, y). Similarly, a meet semilattice quasi- 
metric space {X, d) is invariant if for all x,y, z G X d{x n z,y n z) < d{x, y). 

A 

We are now able to state the main theorem of I1170II . associating invariant 
weighted quasi-metrics and monotone semivaluations on meet semilattices. There 
is also a dual of this theorem for join semilattices that is not presented here. 

Theorem 2.6.20 ( II170II ). For every meet semilattice (X, ^) there exists a bijection 
between invariant co-weightable quasi-metrics d on X with <d=^ and fading 
strictly increasing meet valuations f : (X, ^) — (IR+, <). The map f ^ df is 
defined by df{x, y) = f{x) — f{x □ y). The inverse is the function which to each 
weightable space (X, d) assigns its unique fading co-weight. 

Similarly, one can show that for every meet semilattice (X, ^) there exists 
a bijection between invariant weightable quasi-metrics d on X with <d=^ and 
fading strictly decreasing meet valuations f : (X, ^) <)• The map 

f (-^ df is defined by df{x,y) = f{xr\y) — f{x). The inverse is the function 
which to each weightable space (X, d) assigns its unique fading weight. □ 

The connection of the above result to the quasi-metric semigroups was ex- 
plored in [[T65]l . 

Definition 2.6.21. A quasi-metric semigroup is a triple (X, d, such that (X, d) 
is a quasi-metric space and (X, -k) is a semigroup such that d is -k-invariant, that 
is, for all x,y,z E X 

d{x -k z,y -k z) < d{x,y) and d{z -k x, z -k y) < d{x,y). 

A 
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Definition 2.6.22. We call the triple (X, ^, an ordered semigroup if (X, ^) is 
a partial order and (X, ^^r) a semigroup and for all x,y,z e X, 



Furthermore, if (X, ^) is a meet semilattice, (X, ^, is called an ordered meet 



It is obvious that a quasi-metric semigroup (X, d, corresponds to an ordered 
semigroup (X, <q,-k). Romaguera and Schellekens obtained the following exten- 
sion of the Theorem 12 .6 .201 

Theorem 2.6.23 ( 1116511 ). Let (X, :<,'k) bea meet semigroup, d an invariant weighted 
quasi-metric with <d=^ and f the corresponding strictly decreasing meet valu- 
ation f : (X, ^) (]R+, <) as per Theorem \2.6.20\ Then (X, ^) is a meet 
semigroup if and only if for all x,y,a,b G X 

f{akbnxky)-f{a^b)< f{a n x) + f{h Uy)- f{a) - f{h). □ 

We now survey some of the examples from HI 6511 and 017011 . More examples 
will be provided by the biological sequences. 

Example 2.6.24. Recall the Baire partial metric from Example 12.6. 141 on the set 
S*, of all finite and infinite sequences of elements of an alphabet S. We also 
include 0, the empty sequence in S*. The corresponding weighted quasi-metric 
given by y) = p{x, y) — p{x, x) is an invariant meet semilattice quasi-metric. 
The corresponding partial order corresponds to prefix ordering: b{x, y) = if and 
only if X is a prefix of y. 

Example 2.6.25 ( [fT48l[T65]l ). Denote by I(R) the set of all closed intervals of M 
and equip it with a partial metric p defined by 





semigroup or just meet semigroup. 



▲ 



p{[a, b], [c, d\) = ma.x{b, d} — min{a, c}. 



The associated weighted quasi-metric space is a join semilattice with the partial 
order being the reverse inclusion. 
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Example 2.6.26. Consider the dual complexity space (C*, de*) (Example 12.3.81) 
over the quasi-normed semilinear space (M+, IHIig^) where = x (this is a 

restriction of the quasi-norm on M from Example [233]), that is 

oo 

e* = {/ : N ^ M+ I ^ 2~" f{n) < oo} and 

n=l 

oo 

de4f,9) = Y.'^-''i9in)-f{n)y 0) \/f,gee*. 

n=l 

Then {Q*,de*) is a weighted quasi-metric with the weight being the quasi- 
norm on C* (i.e. w{f) = Yl'^=i 2 " inducing an invariant meet semilat- 
tice. As it is also a semigroup with respect to the addition, it is an example of a 
weightable invariant meet semigroup. 



2.7 Weighted Directed Graphs 

A further important class of examples of quasi-metrics is provided by directed 
graphs. 

Definition 2.7.1. A directed graph, or digraph is a pair [V, E), where is a set 
of vertices or nodes and C y x a set of edges. 

A weighted directed graph or weighted digraph is a triple (V, E, 7) where 
{V, E) is a directed graph and 7 : E ^ M is a function associating a weight 
assigned to each edge. ▲ 

Definition 2.7.2. Let T = (V, E) be a directed graph and let u,v e V. A (di- 
rected) path connecting u and i; is a finite sequence of vertices fo, ^^i, • • • Vn, such 
that vq = u, Vn = V and for alH = 1, 2, . . . , n, (f Vi) E E. 

For each u,v E V, denote by ^(m, v) the set of all paths connecting u and v 
and by i{p) = n the length of a path p. 

A (directed) cycle is a path connecting a point with itself. 

A directed graph T = {V, E) is connected if for every pair of vertices u and v 
there exists a path connecting them. A 
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Remark 2.7.3. A one element sequence Xq is also a path. Indeed, in that case the 
condition that for all 1 < z < n, (fj-i, Vi) G -E, is trivially true. The length of 
such path is obviously 0. 

A connected weighted directed graph with positive weights on all edges can be 
turned into a quasi-metric space by using the weight of the shortest path between 
two vertices as a distance. 

Definition 2.7.4. Let F = (V, E, 7) be a connected weighted directed graph and 
let p be a path in T. Define the weight of p, denoted 7(p) by 

i(.p) 

i=l 

If in addition the weight 7(e), of any edge e E E,is non-negative, we call the 
map dr : V X V R, defined by 



dr{u,v) = inf 7(p), 



the path distance on F. 



Lemma 2.7.5. Let F = (V, E, 7) be a connected weighted directed graph with 
non-negative weights such that for all u,v E V and for all paths p and q such that 

p G ^{u, v) and q G ^{v, u), 

7(p)=7(g)=0 u = v. (2.1) 

Then the path distance dr is a quasi-metric on V. 

Proof. Let u G V. The path p = u has length i{p) = (c.f. the Remark [2.7.3l) and 
the set {i G N : 1 < n < i{p)} is empty. Since a sum over an empty set must be 
0, and 7 is a non-negative function, we have dr{u, u) = 0. The separation axiom 
follows directly from (12.11) . For the triangle inequality, it is sufficient to observe 
that for any three points u,v,w eV and any paths p G ^(n, v) and q G 0^{v,w), 
there exists a path r G ^{u,w), where r = pQ,pi, . . . Pe[p)qiq2 ■ ■ ■ qe(q) such that 
7(r) = 7(p) +7(g). □ 
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Remark 2.7.6. The condition (12.11) is equivalent to the property that no cycle of 
positive length can have a zero weight. 

We call the above metric on graphs a path quasi-metric. The above construc- 
tion is natural and well known (there is a full book devoted to distances in graphs 
[|28l ). especially in the form of path metric which is the metric associated to the 
path quasi-metric of the above Lemma. It naturally leads to consideration of ge- 
ometric properties of digraphs, as in [|35l . The converse is also true: every quasi- 
metric space can be turned into a weighted directed graph such that the quasi- 
metric corresponds to a path metric. 

Lemma 2.7.7. Let {X, p) be a quasi-metric space. Then there exists a weighted 
directed graph T = {V, E, 7) with non-negative weights such that = p. 

Proof. Set V = X and E the set of all pairs (x, y) where x,y E X. For any pair 
{x,y) G X, set 7(x,?/) = p{x,y) so that T = (V, £',7) is a weighted directed 
graph. It is now straightforward to observe that dr = p. □ 

We now review other published work connecting quasi-metrics and graphs. 

Jawhari, Misane and Pouzet HlOlll consider graphs and ordered sets as a kind 
of quasi-metric space where the values of the distance function belong to an or- 
dered semigroup equipped with an involution. In this framework, the graph- or 
order- preserving maps are exactly the 'Lipschitz' maps. They generalise various 
results on retraction and fixed point property for classical metric spaces to such 
spaces. 

Deza and Panteleeva [|47l introduce polyhedral cones and polytopes associated 
with quasi-metrics on finite sets. A cone C generated by a set X C M" is the set 
{Sxgx ^^"^ I '^x G IR+ for all a; G X}. They compute generators and facets of 
these polyhedra for small values of n and study their graphs. This paper gener- 
alises some ideas presented in the book by Deza and Laurent [|48l . Unfortunately, 
analogues of ii embedability and other interesting issues developed in the book 
are not touched. 
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2.8 Universal Quasi-metric Spaces 

Universal metric spaces were introduced by Pavel Urysohn (an alternative spelling 
is Uryson) in the 1920's - his paper [I191II was published posthumously in 1927. 
He showed that there exists a unique universal countable rational metric space U"^ 
and that its completion is the universal complete separable metric space U, also 
called the Urysohn space. The spaces U and are not only universal in the usual 
sense that they contain an isometric copy of every complete separable or countable 
rational metric space respectively - they are also ultrahomogeneous, that is, every 
isometry between finite subspaces of U or U''^ extends to a global isometry. 

Urysohn spaces and their groups of isometrics have recently received consid- 
erable attention [HH [IM [Ml UM [IM [Ml [Ml [Ml- We construct the uni- 
versal countable rational quasi-metric space, which we shall denote and the 
universal bicomplete separable quasi-metric space V using a construction similar 
to Urysohn's and note that the associated metric spaces are exactly the spaces 
and U respectively. 

Definition 2.8.1. A quasi-metric (X, d) where the quasi-metric d takes only ratio- 
nal values is called a rational quasi-metric space. A 

Definition 2.8.2. Let (/) be a class of quasi-metric spaces. A quasi-metric space 
V = (V, dv) of class ip is called universal or Urysohn if it satisfies the following 
properties: 

(i) For every quasi-metric space X = {X, dx) of class there exists an isomet- 
ric embedding X ^Y; {Universality) 

(ii) For every two isometric finite quasi-metric subspaces F, F' of V, the isome- 
try F ^ F' extends to a global isometry V ^ V; (Ultrahomogeneity) 

▲ 

We make use of the following definition. 

Definition 2.8.3. Let X = (X, dx) be a (rational) quasi-metric space, F a finite 
quasi-metric subspace of X and Y = (F, ciy) a (rational) quasi-metric space such 
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that Y = F U {y}, a one point quasi-metric extension of X. A (rational) quasi- 
metric space W = {W,dw) is called a U -extension (respectively U'^ -extension) 
of X with respect to F and Y if there exists an isometric embedding X W 
and a point w & W such that the embedding F ^ X extends to an isometric 
embedding Y ^ W sending y to w. 

A quasi-metric space which is a [/ -extension (f/*^ -extension) of X with respect 
to all finite subsets of X and their one point extensions is called a universal U - 
extension (U^ -extension) of X. 

A quasi-metric space which is a t/ -extension (f/'^-extension) of all of its finite 
subsets is called U -universal (U'^ -universal). A 

We now characterise the universal countable rational quasi-metric space as a 
countable t/^^-universal quasi-metric space and the universal bicomplete separable 
quasi-metric space as a bicomplete separable [/-universal quasi-metric space and 
show they are unique up to an isometry. Existence of these spaces is proven in 
Subsections [TO and IZO 

Lemma 2.8.4. Let U and U' be countable U'^ -universal quasi-metric spaces and 
F and F' finite quasi-metric subspaces ofU and U' respectively. Then an isometry 
F ^ F' extends to a global isometry U ^ U'. 

Proof. We prove the statement using the so-called shuttle or back-and-forth argu- 
ment. Let xo, Xi . . . x„ be an enumeration of U\F and yo, yi . . . y-n an enumeration 
of U' \ F' . Let Xq = F and Yq = F' ■ By our assumption, there exists an isometry 
F ^ F'. Now for each n G N, 

• If x„ ^ X„, set X^+i = X„ U {xn}. Clearly X'^^-^ is finite and by the U^- 
universality of U' there exists y E U'\Yn such that the isometric embedding 
Xn ^ Yn extends to an isometric embedding X^^^ F„ U {y}. Set 
y^+i = F„ U {y}. 

If Xn e Xn, set X^+i = Xn and Y^^^ = Yn. 

• If ^ set Yn+i = r^+i U {yn}. By the [/^-universality of U, there 
exists X e U \ X^+i such that the isometric embedding y^^i ^ ^'n+i 
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extends to an isometric embedding ^ X'n+i U {'■^}- Set = 

K+i u {x}. 

If ?/„ e F^+i, set Yn+i = and X„+i = X^^^ 

It is clear by the recursive construction that for each n E N, X„ C Xn+i, 
Yn C there exists an isometry X„ F„ and for all m < n, Xm G and 
Um e It is now sufficient to observe that U = UneN ^^^^ ^' = UnGN 

to establish existence of a global isometry U ^ U'. □ 

Lemma 2.8.5. Let U = {U,du) be a U- (U^-) universal quasi-metric space, 
X = (X, dx) a countable (rational) quasi-metric space and F a finite subspace 
of X. Then an isometric embedding F ^ U extends to an isometric embedding 
X^U. 

Proof. Let xi, X2, . . . be an enumeration of X \ F and set Fq = F and = 
Fn U {xn+i} for all n G N. By the U- (or U^-) universality of U, Fq ^ U 
extends to an isometric embedding Fi = Fq U {xi} ^ U. Assume that for 
all z < A;, an isometric embedding Fi ^ U extends to an isometric embedding 
^ U. Since F^+i is finite subset of X and F^ embeds isometrically in 
U by our assumption, it follows by the U- (or U^-) universality of U that an 
isometric embedding F^+i ^ U extends to an isometric embedding ^ U. 
Hence, by induction, for all i G N, an isometric embedding F^ ^ U extends 
to an isometric embedding Fj+i ^ U and therefore there exists an isometric 
embedding X = [JZo Fi^U. □ 

Proposition 2.8.6. A countable U'^ -universal quasi-metric space is the universal 
countable rational quasi-metric space. Such space is unique up to an isometry. 

Proof. Universality follows by [/"^ -universality and the Lemma [2 . 8 . 5 1 while ultra- 
homogeneity is a consequence of the Lemma [2 .8 .41 Suppose V"^ and are two 
universal countable rational quasi-metric spaces. Take any finite rational quasi- 
metric space F. By universality, F embeds isometrically into V"^ and V^^ and by 
the Lemma [2.8.41 the isometry between images of F in and extends to a 
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global isometry. Hence any two universal countable rational quasi-metric spaces 
are isometric. □ 

Remark 2.8.7. In fact, t/'^- universality is equivalent to the universality for a count- 
able rational quasi-metric space since obviously universality implies [/''^-universality. 

Proposition 2.8.8. A bicomplete separable U -universal quasi-metric space is the 
universal bicomplete separable quasi-metric space. Such space is unique up to an 
isometry. 

Proof. Let X be a bicomplete separable U -universal quasi-metric space. Every 
bicomplete separable quasi-metric space Y contains a countable dense subset Y' 
which, by the Lemma 12. 8 . 5 1 embeds into a dense subspace of a f/ -universal space. 
This embedding obviously extends to all Cauchy (with respect to the associated 
metric) sequences of points in Y' whose limits are all in X. Therefore, X satis- 
fies universality. On the other hand, the Lemma 12.8.41 can be used to extend the 
isometric embedding F' ^ X of any finite subset of a countable dense subset Y' 
of Y to the isometric embedding Y' X which can then be extended to a global 
embedding since Y and X are bicomplete. 

The Lemma [2.8.41 also implies uniqueness. Suppose V and Vi are two uni- 
versal bicomplete separable quasi-metric spaces. Any finite rational quasi-metric 
space F embeds isometrically into V and Vi by universality and by the Lemma 
12. 8. 41 the isometry between images of F in V and Vi extends to a global isometry 
between countable dense subsets of V and Vi. Since V and Vi are bicomplete, 
such isometry extends to an isometry V ^ Vi . □ 

Remark 2.8.9. The metric space associated to a universal quasi-metric space is 
also universal since every isometry between quasi-metric spaces is an isometry 
between their associated metric spaces (Lemma [2. 1.81) . Therefore, (V"^)* = U"^ 
andV^ = U. 

2.8.1 Universal countable rational quasi-metric space 

Lemma 2.8.10. Let X = (X, dx) be a quasi-metric space and F a finite quasi- 
metric subspace of X. Let Y = (Y, dy), where Y = F U {y}, be a (rational) 
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quasi-metric space containing F as a quasi-metric subspace plus an extra point 
{y}. Then, there exists a U -extension ofX with respect to F and Y. If all X and 
Y are rational quasi-metric spaces, there exists a -extension ofX with respect 
to F and Y. 

Proof. Let X, F and Y be as above and Tx = (X, E, 7) the weighted directed 
graph from the Lemma [2.7.7l such that the path quasi-metric on Tx coincides with 
dx- Add another point to Tx, that is, let Tw = (W, 7') be a weighted directed 
graph such that W = X U {w}, E' = E U {{x, w) \ x E F} U {{w, x) \ x E F} 
and 



It is clear that Tw is connected and hence the path quasi-metric dTy^, is well- 
defined ("Lemma [2 .7 .5 1) . Let dw = d^^^., and Y' = F Li {w}. To complete the 
proof we verify that dw\F = dx\F and dw\Y' = dy- Let u,v E W. Denote by 
^{u, v) the set of all paths in W linking u and v. 

Since F embeds isometrically in X, and X embeds isometrically in W it is 
clear that dw\F < dx\F. Let u,v E F and suppose that there exists a path 
p E ^{u,v) such that dw{u,v) = l'{p) < dx{u,v). Then p must pass through 
w implying that d\Y{u,v) = dw{u,w) + dw{w,v) = dyiu.w) + dyiw.v) > 
dyiu, v) by the triangle inequality. As Y is an extension of F, we have rfy (u, v) = 
dx{u,v), implying dw{u,v) > dx{u,v) and contradicting our premise. There- 
fore, dw\F = dx\F = dylF. 



Let u E F. It is clear from the Equation 12.21 that dY/{u,w) < dY{u,w) 
and dwiw^u) < dy{w,u). Suppose there exists a path p E ^{u,w) such that 
dwiu, w) = 7'(p) < (iy (n, w). As there is no edge (x, w) in E' for any x E X\F, 
such p cannot pass through any point in x E X \ F, nor can it pass through w 
except as a last point. On the other hand, for any v E F, dwiu, v) + dwiv, w) = 
dy{u, v) + dy{v, w) > dw{u, w) by the triangle inequality. This contradicts our 
supposition and hence dyy{u, w) = dy{u, w). In the same way it can be shown 



7(u, v) if u E X and v E X, 
7'(m, v) = < dyiu, w) ifu E X and v = w, and 
dy{w, v) ifu = w and v E X. 



(2.2) 
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that dw^w, u) = dyiw, u) and therefore dw\Y' = dy- 

It is obvious that {W, dw) is a rational quasi-metric space if dx and dy take 



Denote by W{X, {F, Y)) the U- (or f/^-) extension of X with respect to F 
and F constructed in the Lemma [2.8.10[ 

Lemma 2.8.11. Let (X, dx) be a countable rational quasi-metric space. Then 
there exists a countable U'^-universal extension ofX. 

Proof. Let ^(X) be the set of all pairs (F, Y) where F is a finite subspace of 
X and F is a rational quasi-metric space F = F U {y} containing F as a quasi- 
metric subspace plus an extra point {y}. Since X is countable and dx takes values 
in Q, ^(X) is countable. Let Nq, Xi, ... be an enumeration of We now 

construct the required space recursively. 

Let Zq = W{X, No) and Zi+i = W{Zi, Ni+i) for all « G N. We claim that 
for each i E N, X C Zi and Zi is a U'^ extension of X with respect to Ni. Lideed, 
X C Zq and Zq is a U'^ extension of X with respect to Nq. Assuming for all G N 
that X C Zk and denoting Nk+i = (F', Y'), it follows that F' is a finite subset of 
Zfc and hence Z^+i is well-defined. By the Lemma 12.8.101 X C Z^ C Z^^i and 
Zi is a U"^ extension of X with respect to Xfc+i- Our claim therefore follows by 
induction and the union Ujgn '^he required countable f/^-universal extension 



Denote by Z{X) the f/^-universal extension of a rational quasi-metric space 
constructed in the Lemma [2.8.1 1[ 

Corollary 2.8.12. There exists a countable U'^ -universal quasi-metric space Y'^. 

Proof. We again employ recursion. Set f/o = {*}, a one-point quasi-metric space, 
Un+i = Z{Un) for alH G N and U = IJneN ^n- claim that for every finite 
rational quasi-metric space F = (F, dp) of cardinality n>l 

(i) there exists an isometric embedding F ^ Un-i, and 

(ii) Un is a t/^-universal extension of F. 



values in rationals. 



□ 



ofX. 



□ 
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It is clear by the above construction that this is indeed the case for the one-point 
quasi-metric space. Assume our claim holds for some k E N and let F' be a 
finite quasi-metric space of cardinality k + 1. Let F" be a A;-point restriction of 
F'. By our claim (ii), f/„ is a ^/''^-universal extension of F" and hence contains an 
isometric copy of F'. By the Lemma 12.8.1 11 Uk+i is a f/*'^ -universal extension of 
F' and we have proven our claim by induction. Each of sets Un is countable and 
therefore V = is a countable ^/''^-universal quasi-metric space. □ 

2.8.2 Universal bicomplete separable quasi-metric space 

To show that the bicompletion of the universal countable rational quasi-metric 
space is the universal bicomplete separable quasi-metric space we extend the ar- 
gument of Gromov ( 11791 . pp. 80-81) for the universal metric spaces. 

Lemma 2.8.13. Let X = (X, dx) be a quasi-metric space admitting an every- 
where dense W^-universal quasi-metric subspace Z = {Z, dz)- Then for each 
finite subset F G X, every 6 > and any one point quasi-metric extension 
{Y, dy) of F, where Y = F U {y}, there exists x E X such that for all f E F 

\dx{xJ)-dY{yJ)\<5 

and 

\dx{f,x)-dY{f,y)\<S. 

Proof. Let X, Y, Z and F = {/i, /2, . . . , /„} be as above and let 5 > and 
e = |. Since Z is everywhere dense in X we can approximate F by the set 
F' = {/(, . . . , /a C Z such that for alH = 1, 2, . . . n, dx{f^, //) < e and 
dxifi, fi) ^ ^- Let Tp' = (-F', 7) be the weighted directed graph from the 
Lemma [2.7.7l such that the path quasi-metric on Tpi coincides with dx\F'. Con- 
struct a one point extension Fy/ = {Y',E','y') such that Y' = F' U {y'} and 
E' = EU {{y', //),(//, y') M = 1, 2 n} U {{y', y')}. Set 7' (y', y') = and for 
each i, let 7(/j', y') be any rational such that 

dY{yJ^)-e<y{y'J',)<dy{yJ,)+e, 
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and ■y{y', f'i,) a rational such that 

dY{h,y)-e<y{fiy')<dy{f,,y) + e. 

By the Lemma [2.7.51 Y' = (Y, (iry,) forms a rational quasi-metric space which 
is a one point extension of F' C Z. By the [/"^-universality of Z, there exists 
X e Z such that for each i = 1,2, . . .n, dx{x, f-) = dz{x, //) = dr^, {y', fl) and 
dx{fl,x) = dz{f'i-,x) = dr^Xfi^y')- It remains to verify the required inequali- 
ties. 

Clearly, for each i, dr^, (// , v') < I'ifl, v') and hence 

dx{xj^)<dx{xj:)+dx{f:j^) 

<dr,,{y'Jl)+e 
<l\y'J'i)+e 
<dY{yJ,) + 2e. 

On the other hand, since dr^, is a path quasi-metric, there exists 1 < j < n 
such that dry, iv', f'd = I'iv'^ fj) + dxifj, f'd (this includes the case j = i) and 
therefore 

dx{x, fi) > dx{x, f-) - dxifu fl) 
>dT,Xy'J'^-e 

> ^\y' J'^) + dxU'v fi) - e 

> dviy, fj) + dx{f„ fi) - dxifi fd - dxUv fj) - 2^ 

> dY{yJi) + dY{fjJi)-Ae 

> dY{yJi)-Ae. 

Thus, for all f E F, \dx{x, f) — dY{y, f)\ < 4e = 5. The other inequality is 
verified in the same way. □ 

Lemma 2.8.14. Let X = {X,dx) be a bicomplete quasi-metric space admit- 
ting an everywhere dense -universal quasi-metric subspace. Then X is a U- 
universal quasi-metric space. 
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Proof. Let X be a as above, F a finite subset of X and (F U {y}, dy) a one-point 
quasi-metric extension of F. We must show that there exists a point x G X such 
that for each / G F, dxix, f) = dyijj, f) and dxif, x) = dyif, y). 

Assume without loss of generality that for all f ^ F, dyiy^ /) > 5 > 0, that 
is, one of the distances rfy (y, /) and (iy(/, y) is bounded below by 5 while the 
other can be 0. We find by induction a sequence of points xq, xi, . . . Xj, . . . G X 
such that for all / e F and alH = 1, 2 . . . 

(i) <52-% 

(ii) \dx{xiJ)-dY{yJ)\<62-\ 

(iii) dx{xj, Xj+i) < 52"-'+^ for all j = 2, 3, . . . z, and 

(iv) mm{dxif,Xi),dxixiJ)} > 362~\ 

Indeed, assume such elements Xi exist for all z = 1,2, .. .k. Let Fk = F U 
{xi,X2, . ■ ■ , Xk} and Y' = F^U {y'}, a one point extension of F^. We claim there 
exists a quasi-metric dy' on Y' satisfying 

(a) dy'lFk = dx\Fk, 

(b) dy.U.y') = dyU.y), 

(c) dy>{y',f) = dy{y,f), and 

(d) dy>{y',Xk) = dy>{xk,y') = 62^^. 

It clear that the condition (a) defines a quasi-metric on Fk. We will show that the 
conditions (a), (b), (c) and (d) together also define a quasi-metric dp' on F' = 

FU{xk,y'}. 

Denote by A(u, v, w) the triangle inequality dF'{u, w) < dF'{u, v) + dF'{v, w) 
for some points u,v,w E F' . The inequalities A{y' , fi, f2), A(/i,y', and 
A(/i, /2, y') where fi, /2 G F follow from our assumption of Y being a quasi- 
metric space while the inequalities A{y', Xk, /), A(/, y', Xk), A{y', Xk, /), 
A{xk,y',f) and A{f,Xk,y') where f E F clearly follow by (i) and (ii). The 
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remaining two inequalities, A{y', /, x^) and A(xfc, /, y') follow directly from (iv) 
(wehave > 352-'= > 62-'' = dF'{y\xk) mddF'{xkJ) > 3(52"*^ > 

62-'^ = dF,{xk,y')). 

Therefore, dp' is a quasi-metric on F' = F U {xk,y'} agreeing with the in- 
duced quasi-metric on Fk = F Li {xi,X2, . . . , x^} on the intersection Fk (1 F' = 
F U {xk}- Hence, there exists a quasi-metric on the union Y' = FkU F' satisfying 
the properties (a) - (d) (this is easily shown by taking the distance between any 
two points not in the intersection to be the shortest path through the intersection). 

By the Lemma 12.8.131 there exists a point Xk+i G X such that for each /' G 

Fk, 

\dx{xk+iJ')-dyiy'J')\<62-''-' 

and 

\dx{f',Xk+i)-dy,{f',y')\<62-'-' 
and thus, by (a) and (b), it follows that for all f E F, 

\dxixk+ij) - dYiyJ)\ < 62~^'^'^ 

and 

\dx{f,Xk+i)-dY{f,y)\<52~^''+'\ 

Furthermore, by (d), 

dx{xk+i,Xk) < 62-''-' + dY'{y',Xk) < 62-"+' 

and 

dx{xk,Xk+i) < 62-''-' + dY'{y',Xk) < 62-''+', 
implying dx{xk, Xk+i) < 62-''+'. Finally, for all f E F, 

dxif,Xk+i)>dY'{f,y)-62-''-' 

> dxif,Xk) - dY'{y',Xk) - 62-''-' 

> 362-'^''+'\ 

Similarly, dx{xk+ij) > 362-^"+'^ 
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We conclude by induction that there exists an infinite sequence Xi, X2, • • • sat- 
isfying (i) - (iv). By (iii), this sequence is d5(:-Cauchy and hence convergent since 
X is bicomplete. It converges to the required x by (i) and (ii). □ 

Corollary 2.8.15. There exists a U -universal bicomplete separable quasi-metric 
space V. 

Proof. The required space V = V^, the bicompletion of the universal countable 
rational quasi-metric space V^. □ 
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Chapter 3 

Sequences and Similarities 



Pairwise sequence comparison is undoubtedly one of the core areas of bioinfor- 
matics. The most well known tool (actually a set of tools) is NCBI BLAST (Basic 
Local Alignment Search Tool) ^ which, given a DNA or protein sequence of 
interest, retrieves all similar sequences from a sequence database. The similar- 
ity measure according to which sequences are compared is based on extension of 
a similarity measure on the set of nucleotides in the case of DNA, or the set of 
amino acids in the case of proteins to DNA or protein sequences, using a proce- 
dure known as alignment. Two types of (pairwise) alignments are usually distin- 
guished: global, between whole sequences and local, between fragments of se- 
quences. Similarity scores on nucleotides or amino acids, as well as the penalties 
for 'gaps' introduced into sequences while aligning them, usually have statistical 
interpretation. 

The objective of this chapter is to establish the link between similarity mea- 
sures on biological sequences and quasi-metrics. While the connections of global 
similarities to (quasi-) metrics have been known for long [I178II . the novel result 
is that local similarities can also be converted to quasi-metrics while preserving 
the neighbourhood structure. The assumptions required for such conversion are 
satisfied by the similarity measures most widely used for searching DNA and pro- 
tein databases. We develop this result in the context of free semigroups, which 
correspond to sets of strings from a finite alphabet and use the string and semi- 
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group terminology interchangeably. The use of semigroup terminology may point 
to generalisations and extensions of our results to other areas. 

3.1 Free semigroups and monoids 

Recall that the free monoid on a nonempty set S, denoted S*, is the monoid whose 
elements, called words or strings, are all finite sequences of zero or more elements 
from S, with the binary operation of concatenation. The unique sequence of zero 
letters (empty string), which we shall denote e, is the identity element. The free 
semigroup on S, denoted S"*" is the subset of S* containing all elements except 
the identity. 

The length of a word w G S*, denoted \w\, is the number of occurrences of 
members of S in it. For w = o\02 ■ ■ ■ where cTj G S, = n and we set 
|e| = 0. 

For two words u,v E S^, u is di factor or substring of f if f = xuy for 
some x,y G S*; n is a prefix of v if v = uw for some w G S*; n is a suffix 
of f if f = wu for some w G S*; n is a subsequence or subword of v if v = 
wlulw2U2 ■ ■ ■ where u = ulu^ ■ ■ ■ m* , u* G S* and w* G S*. For any 

X G S*, we use to denote the set of all factors of x. 

We call a semigroup (monoid) (X, -k) free if it is isomorphic to the free semi- 
group (monoid) on some set S. The unique set of elements of X mapping to S 
under the isomorphism is called the set of free generators. 

As a convention, for any word m G S*, the notation u = uiU2 ■ ■ - Un, where 
n = \u\ shall mean that G S while the notation u = ulu^ ■ ■ - u*^ shall imply 
that u* G S*. For all 1 < A; < |n| we shall use Uk to denote the word uiU2 ■ ■ - u^ 
and set uq = e. 

The motivating examples of free semigroups for this chapter are biological 
sequences and structures related to them. It is quite natural that those macro- 
molecules which are linear polymers of a limited number of small molecules and 
whose properties strongly depend on the sequence of their constituent building 
blocks can be represented in this way. For example, a DNA molecule can be rep- 
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resented as a word in the free semigroup generated by the four-letter nucleotide 
alphabet S = {A, T, C, G} while an RNA molecule is a word in the free semi- 
group generated by the alphabet S = {A, U, C,G}. A protein can be thought of 
as a word in the free semigroup generated by the amino acid alphabet (Table [LTI) . 

A further example from biological sequence analysis is provided by profiles 
[[781 12181 . Let S be a set and denote by M(S) the set of all probability measures 
supported on S. We shall call the elements of the free monoid M(S)* profiles over 
S*. Profiles arise as models of sets of structurally related biological sequences 
where S is the DNA or protein alphabet. 

3.2 Generalised Hamming Distance 

A simplest way to extend a distance from generators to words of equal length is 
to use what we call a generalised Hamming distance, a special case of the ii-type 
sum mentioned in the Example l2.2.16[ 

Definition 3.2.1. Let S be a set and let T,"- = {w E 'E'^ : \w\ = n}, the set of 
words in the free semigroup generated by S of length n. Let : S x S ^ M be a 
distance on S. The generalised Hamming distance on S" is a function li : x S" 
where 

n 

d{u,v) = '^ds{ui,Vi). 

i=l 

▲ 

As mentioned in the Example 12.2. 17[ the Hamming distance is a special case 
where is the discrete metric. If the distance on the set of generators S is a quasi- 
metric, the same holds for the generalised Hamming distance on (Example 
12.2.161) . Obviously, similarity measures on the generators can be extended in the 
same way. 

The generalised Hamming distance has an advantage that it can be computed 
in linear time. It can be interpreted as the total cost of substitutions necessary to 
transform one word into another. It is worth noting that it is permutation invariant 
- permuting both words with a same permutation does not change their distance. 
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The main practical disadvantage of the generalised Hamming distance is that 
it is restricted to the words of the same size and that it does not consider any other 
type of transformation but substitution. Hence it is only suitable for modelling 
the sets of words of the same length where insertions or deletions of factors (i.e. 
single characters or segments) are unlikely. 

3.3 String Edit Distances 

The term string edit distances shall be used to refer to all distances between words 
defined as the smallest weight of a sequence of permitted weighted transforma- 
tions transforming one word into another. In a stricter sense, the string edit dis- 
tance denotes the smallest number of permitted edit operations required to trans- 
form one string into another where the permitted edit operations are substitutions 
of one character for another, insertions of one character into the first string and 
deletions of one character from the first string. It was first mentioned in the pa- 
per by V. Levenstein HI 2211 and is often referred to as the Levenstein distance. In 
their 1976 paper [I203II . Waterman, Smith and Beyer introduced the most general 
form of the string edit distance and proposed an algorithm to compute it in some 
important cases. Below, we outline their construction of the so-called T-(quasi-) 
metric which we shall refer to as the W-S-B distance. 

3.3.1 W-S-B distance 

Definition 3.3.1. Let S be a set and S* a free monoid over S with the identity 
element e. Suppose r = {T : ^(T) S* | &{T) C S*} is a finite set of 
transformations defined on subsets S* such that the identity transformation / is in 
r. Let w : T ^ IR+ be a function such that w(T) = T = I. We call the 

pair (r, w) a set of weighted edit operations onT,*. A 

Definition 3.3.2. Let S be a set and (r, w) a (finite) set of weighted edit operations 
on S*. Let u = uiU2 ■ ■ - Un G S*, where Uj G S and let T E t. Fix I < j < n 
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and suppose UjUj+i . . .Un ^ ^(T). Then is defined by 

T^{u) = U1U2 . . . Uj^iT{UjUj+i . . .Un). 

If e e &{T), then T"+i is defined by T"+1(m) = uT{e). 
For any u,v G S* define 

{u^v}^ = {Ti:x:::.---.Tii^Ti:T^::i...Tii{u) = v], 

where Tj^ G r, that is, {m v}^ is the set of all finite sequences of transforma- 
tions from r such that ordered composition of such transformation maps u into v. 
The members of {u v}t- are called edit scripts. Also, if {u v}t- 7^ 0, for any 
C = ^,7tr^, ••• ,^ e ^ define 

m 
k=l 

A 

Remark 3.3.3. In theory, r can be allowed to be an infinite set. In that case, 
the minimum in the Definition 13.3.41 of the r-distance below must be replaced 
by infimum and many proofs become very awkward. So far there have been no 
interesting examples involving infinite sets of transformations. 

Definition 3.3.4. Let S be a set and (r, w) a (finite) set of weighted edit operations 
on S*. For any u,v E S*, define the r-distance Pr,w : S* ^ S* by 

Pr,wiu,v) = min w(C), 

if {u — > v}t- and Pt,w{u, v) = ooif {u ^ v}r = 0- A 

Hence, the r-distance between two words is the smallest weight of an edit 
script of operations in r transforming (in the sense of ordered composition) one 
word into another. 

The relation Pr,w{u, f ) < cxd is an equivalence relation and partitions E* into 
equivalence classes {S*} where the value of p^^^ between any two members of 
S* is finite. We have the following simple fact: 
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Theorem 3.3.5 ( I1203II '). Let T,be a set and (r, w) a set of weighted edit operations 
on S*. For each equivalence class S* o/S*, p,- ,„|S* is a quasi-metric. □ 

The r-metric is defined on each E* as the associated metric ^. Note that the 
requirement that w{T) > for each T E r such that T ^ I implies that p^^^ is a 
Ti -quasi-metric. 

Remark 3.3.6. It is easy to observe that the r-quasi-metric is equivalent to the 
path quasi-metric on the connected components of a weighted directed multigraph 
(two vertices can be joined by more than one directed edge) where the vertices 
are words in S* and two words u and v are joined with an edge if there is a 
transformation T E r such that for some j, T^{u) = v. The weight of each 
edge is the weight of the corresponding transformation and an edit script is a path 
in the multigraph. Section [2!7] presents the development of path quasi-metric on 
a weighted directed graph and the same technique can be trivially extended to 
multigraphs. 

We now present the terminology and notation for the most biologically rele- 
vant sets of weighted edit operations. 

Definition 3.3.7. Let S be a set and S* a free monoid over S with the identity 
element e. Define the following transformations of elements of S*: 

• T„_ : uv i-^ V, where u E S"*", v E S*, 

• Tu+ : V I— >■ uv, where u E S^, v E S*, and 

• T(^a,b) '■ CLU I— >• hu, where a,b eH and m G S*. 

The transformations of the type T(^a,b) are called substitutions or mutations, of the 
type Tu+ are called insertions and of the type T„_ are called deletions. Insertions 
and deletions are collectively called indels. 
Define 



To = {T,_ : a e S} U {T,+ : a G S} U {T^a,b) : a, 6 G S} 
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and 

TA = {T„_ : M e S+} U {Tu+ : M e S+} U {T^a,b) : a,h e S}. 

▲ 

Note that tq and ta implicitly contain the identity transformation / = Ti^a,a) 
for any a G S. 

Example 3.3.8. For a set of letters S, the Levenstein distance is realised as Pro,w 
where w{T) = I for all T G tq such that T ^ /. 

While providing an easily interpretable example, the Levenstein distance is 
too simplistic for comparison of biological sequences and more general distances 
must be used. From an evolutionary point of view, each transformation should 
correspond to a mutational event and the resulting distance to the 'evolutionary 
distance' between two sequences. In practice, not all transformations of biological 
sequences are equally likely. For example, substitutions are generally more likely 
than indels, while some substitutions may be more likely than others. This is 
certainly the case in proteins where one observes for example, that substitutions 
of I for V are more common than substitutions of I for K. It was also argued [I178II 
that indels are more likely to take place by segments than character-by-character 
and hence that indels of arbitrary segments should take weights smaller than the 
sum of the weights of indels of single characters comprising each segment. 

Example 3.3.9. The Sellers (or s-) distance, introduced by Sellers in 1974 [I171II . 
is a metric obtained by extension of a metric p on the set = S U {e}, the set of 
generators plus the identity element, to the free monoid S*. The value of p{a, r) 
for cr, r G S represents the cost of substitution of a for r in a word in S+ while 
p(cr, e) is the cost of insertion or deletion of a character a. 

The s-metric can be considered as a special case of the W-S-B metric by using 
To as the set of transformations. Suppose w{Ta-) = d{a, e), w(Ta+) = d{e, a) and 
w{T{^a,b)) = d{a, b). Waterman, Smith and Beyer II203II showed that the necessary 
and sufficient condition for the r-metric induced by the above weights to coincide 
with an s-metric is that d be a metric on S^. 
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In fact, the construction of Sellers has long been known in the theory of topo- 
logical groups I1153L The s-metric on S+ is equivalent to the Graev pseudo-metric 
[|75l l76l on the free group -F(S) (i.e. the free group generated by S), restricted 
to E"*". The Graev pseudo-metric, can be described as the maximal bi-invariant 
pseudo-metric p on -F(S) such that p|X^ = p. 

Example 3.3.10. Let S be a set and for m, f G S* denote by LCS{u, v) the longest 
common subsequence of u and v. Define 

Plcs{u,v) = \u\ + \v\ - 2 \LCS{u,v)\ . 

It can be easily shown that Plcs is a metric on S* and that pics = Pto,w where 
w{Ta+) = w{Ta-) = 1 and w{T(^a,b)) > 2 for all a, 6 G S (i.e. optimal sequences 
of edit operations only involve indels). The LCS metric provides a special case of 
string edit distance (more specifically of Sellers distance) which has been exten- 
sively studied in computer science 

Example 3.3.11. Let S be a set and suppose r consists only of the transformations 
of the type T(^a,b), where a,b E S. Suppose w{T(^a,b)) = dY;{a,b) where is a 
function S x S — IR+ such that d{a, a) = for all a G S and d{a, b) > for 
all a 7^ b.. It is clear that Pr^w{u,v) = oo if and only if \u\ ^ \v\ and therefore 
the partitions of the equivalence relation Pt,w{u, v) < oo are the sets for all 
n G N_|_ plus the set {e}. It is easy to verify that on each S", coincides 
with the generalised Hamming distance d if and only if d satisfies the triangle 
inequality (i.e. d is a quasi-metric). 

3.3.2 Alignments 

In biology, one is usually interested not only in the distance between two words, 
but also in the edit script realising it. A standard way of representing an edit script 
mapping one sequence into another is called a (pairwise) alignment. 

Definition 3.3.12. Let S be a set, u,v G S+ and suppose {tx,w) is a set of 
weighted edit operations on S*. A global alignment between u and v is a finite 
sequence of pairs (n*, v*) such that u*, v* G S* for all i and 
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(i) u = ulu; . . . <, 

(ii) V = vlv* . . . 

(iii) u\ ^ e\J v\ ^ e for all i, and 

(iv) there exists T e tx such that v* = T{u*). 

The weight or score of the alignment ((n*, v*))i is the sum ty(Tj) where Tj G 

TxMidv* =Ti{u*). k 

The axiom (iii) in the Definition 13 . 3.121 above ensures that a sequence that is a 
global alignment is finite. 

Definition 3.3.13. A local alignment between m, v G S* is a global alignment 
between u' and v' where u' is a factor of u and w ' a factor of f . A 

Alignments are usually displayed by first inserting chosen spaces (or dashes), 
either into or at the ends of u and v, and then placing the two resulting strings 
one above the other so that every character or space in either string is opposite a 
unique character of a unique space in the other string [[83l . 

It is obvious that every (global) alignment can be associated with an edit script 
of the same weight. The converse is not true in general as the Example 13.3.141 
attests. Recall that t\ consists of substitutions, insertions and deletions (Definition 
13.3.71) and that a superscript on a transformation T denotes the start of the fragment 
being acted on by T (Definition l3.3.2|) . 

Example 3.3.14. Let S = {a, 6, c} and consider (ta, u'),the set of weighted edit 
operations on S* where w{T(^a,b)) = uj{T(^b^c)) = 1, uj{T(^a,c)) = 3 and for each 
ueJ:*,w{Tu+)=w{Tu-) = 5. 

Suppose u = aa and v = ac. Then, it is clear that ( = T^f,c)^T^ab) ^ 
{u v}t-^ and that w{C) = 2. However, the alignment of smallest weight, 
A = (a, a), (a, c), has weight 3. It is easy to see that all other possible alignments 
have an even greater weight. 
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Definition 3.3.15. Let m, t; e S+. An edit script T^", T^'^'S . . . ,T^' e {u ^ 

v}ry^ admits an alignment if there exists a sequence {u*)"^-^ where u* E S* such 
that u = u*^u*^_^ ...u\&ndv = {u*jTi^^_, «„i) . . . T,, (m*). A 

The following Lemma provides a straightforward characterisation of the above 
definition. 

Lemma 3.3.16. Let x,y e S+. An edit script T^^,T^^:^, • • • , G {x ^ y}^^, 
where jm < jm-i • • • < ji, admits an alignment if jm = 1 ^/^f^? 

(i) ii = \x\ ifTi^ = T^a,b) for some a,b eT,, 

(ii) ii = kl + 1 ifTi^ = Tu+ for somen G S"*^, 

(Hi) ii = \x\ — |u| + 1 ifTi-^ = Tu^ for somen G E"*", 
and for all 1 < k < m, 

(iv) jk = jk-i - 1 ifTi^ = T(^a,b)for some a,b e S; 

(v) jk = ik-i ifTi^ = Tu+ for some n G S+; 

(vi) jk =jk^i-\u\ ifTi^ = Tu^for some nET.+; 
Proof. For each /c = 1, 2 . . . m set 

a, if Tj^ = T(^a,b) for some a, fe G S 
a^I = < e, if Tj^ = r„+ for some m G S+, 
M, if Ti^ = Tu- for some m G S+. 

We claim that x = x*^x*^_^ ...x\&ndy = Ti^{x*jTi^_^{x*^_^) . . . Ti(x^). The 
first claim is proven by showing by induction that for all /c = 1, 2 ... m, 

•^jk-^jk+^ ■ ■ ■ — Xj^Xf,__i . . . Xi- 

Indeed, the conditions (i), (ii) and (iii) directly imply the base step while the con- 
ditions (iv), (v) and (vi) imply the inductive step. Since jm = 1, it follows that 
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Similarly, the second claim is proven by showing by induction that for all 
k = 1,2 . . .m. 

The base step in this case follows from the definition of while the inductive 
step follows easily from the conditions (iv), (v) and (vi). □ 

The following simple result was first observed by Smith, Waterman and Fitch 

uni. 

Lemma 3.3.17 ( II178II ). Let be a set, u,v e E* and suppose {{u*,v*))i is a 
global alignment between u and v. Then 

+ =2^^M,,, + ^A;4 + ^fcDfc (3.1) 

aeS 6GS k k 

where Ma,b = \{i : u* = a A v* = b \ a, b E Ik = \{i : u* = e A \v*\ = k}\ 
and Dk = \{i : v* = e A = k]\. □ 

String edits and alignments are best illustrated by examples. For simplicity we 
use the Levenstein distance. 

Example 3.3.18. Let S be the English alphabet, let u = COMPLEXITY and v = 

FLEXIBILITY. It is easy to see that the Levenstein distance between u and v is 
8. Indeed, if we align u and v in the following way, 

COMPLEXI TY 

FLEXIBILITY 

we note that seven indels and one substitutions are necessary to convert u into 
V and vice versa. One can also easily see that this is the smallest number of 
transformations necessary (more formally, this fact would be a simple corollary 
of the Theorem l3.3.27l to be stated and proven later). 

The string edit distances may, in some cases, be more suitable for comparison 
of strings of the same length than the (generalised) Hamming distance. 

Example 3.3.19. Consider the words u = ABCDEF and v = FABCDE of length 
6. The Hamming distance between u and v is 6 while the Levenstein distance is 
2. 
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3.3.3 Dynamic programming algorithms 

While the r-metric (and quasi-metric) can be generated from any sets of transfor- 
mations of S*, the main motivation of Waterman, Smith and Beyer in II203II was to 
extend the construction of Sellers [I171II so that indels of multiple characters with 
weights less than the sum of the weights of indels of individual characters can be 
permitted. The algorithm they proposed for computing such distances is based 
on dynamic programming technique, introduced by Bellman lfT3l in the general 
context and first applied to biological sequence comparison by Needleman and 
Wunsch II146II using similarities and by Sellers I1171II using distances. Dynamic 
programming remains the foundation of all pairwise biological sequence align- 
ment algorithms and we here briefly present it in relation to the W-S-B algorithm. 

The three essential components of the dynamic programming approach are 
recurrence relation, tabular computation and the traceback. 

Recurrence Relations 

We now outline the recurrence relations used for computation of the W-S-B metric 
which takes into account indels of multiple characters. 

Definition 3.3.20. Let S be a set. The set of weighted edit operations (r^, w) on 
E* satisfies the condition M if for all x,y E S+ and for each sequence of edit 
operations C G {x ^ y}r^ there exists rj E {x y}r^ which admits an alignment 
and w{ri) < w{C,). A 

The condition M was introduced in [I203II in a slightly different but essentially 
equivalent form. It implies that the W-S-B distance between any two points is 
determined solely from edit scripts admitting an alignment and leads to the fol- 
lowing theorem. Recall that for all m G S* and for any 1 < A; < denotes 
the word uiU2 ■ ■ - Uk and that uq = e. 

Theorem 3.3.21 ( 1120311 ). Let J2 be a set, x,y e S* and suppose {tx,w) is a 
set of weighted edit operations on S* satisfying the condition M. Then, for all 
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< i < \x\, < j < \y\ such that i + j 0, 

pT^A^h Vj) = mill |pr;„U,(^i-l, Vj-l) + (T(^i,y,)), 

l^k<j i^-A.^-l^i' Vj-k) + W^(^y,-fc+l%-fe+2...% + )} . 

where Prx,w{xp, yq) is ignored ifp or q are negative. 

Proof. Obviously p(xo,|/o) = 0. Fix < i < |x| and < j < \y\ such that 

1 + j 7^ 0. Since (r^, w) satisfies the condition M, there exists an edit script 
^/J' ^/r'l'' • • • ' ^4 ^ ^ yj}rx that admits an alignment and pr^^U^i, Uj) = 
Er=i^(^iJ- Since r/;;,T/™-\ . . . ,Z;(^ admits an alignment, it follows that 
Ti^XZl^ • • • e {x, ^ yy}r, forsome^' < z,/ < j and that p,,,^(x,.,^,0 = 
X;r=2^(^iJ (otherwise the assumption prx^^{xi,yj) = YJk=i^{Ti^) would be 
violated). The proof is completed by considering all possibilities for Tj^ . □ 

Remark 3.3.22. Under the conditions of the Theorem l3.3.21l it is clear that Pr^^w 
is invariant (in the sense of the Definition I2.6.21|) with respect to the string con- 
catenation, that is, for all x,y, z & , 

prx,n,{xZ, yz) < prx,^ix, y) and pr^^^i^ZX, Zy) < prx,n,{,X, y). 

Hence, the triple (S*, p^^^, -k) where is the string concatenation operation is a 
quasi-metric semigroup (Definition 12.6.211) . 

Definition 3.3.23. Let S be a set. A map / : S+ ^ M is called increasing if for 
any -u e S+ and any v G '^{u) \ {e}, f{v) < f{u). k 

Definition 3.3.24. Let S be a set. The set of weighted edit operations (ta, w) on 
S* satisfies the condition N if 

(i) w{T(^a,b)) = d{a, h) for all a, 6 G S, 

(ii) w{T.u+) = g{\u\) + E'l, for all n G E+, and 
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(iii) w{T^.) = h{\u\) + J2kU tiui) for all u G S+. 

where d is a quasi-metric on S, g,h are non-decreasing positive functions N 
]R+, and s,t are non-negative functions S ^ IR+ such that for all a,b E S, 
s{b) — s(a) < d{a, b) {s is right 1-Lipschitz) and t{a) — t{b) < d{a, b) (t is left 
1-Lipschitz). A 

We now show that the condition N implies the condition M. 

Lemma 3.3.25. Let be a set and (ta, w) a set of weighted edit operations on S* 
satisfying the condition N. Suppose x = X1X2 ■ ■ - Xm G S*, 1 < ^2 < ji < + 1 
and let Ti,T2 G r such that T/^T|^(x) is well-defined. Denote x' = Tl^Tl^{u) 
and ( = T(^,T2^ G {x — x'}r^. Then, there exists an edit script r] = T|^, T4 G 
{x — * x'}r^ such that j 2 < I andwirf) < w{(). 

Proof. There are nine principal cases corresponding to all combinations of trans- 
formation types in (. 

If T2 = T(a,b) for some a, 6 G S (the transformation acting on the position 
i2 is substitution), it is easy to see that Tl^T^^ = T^Tl^, whatever Ti might be. 
Similarly, if T2 = T„_ for some i; G S"*" (the transformation acting on the position 
i2 is deletion), we have Tl^T2 = Tl^T[, where / = ji + \v\, again whatever Ti^^^ 
might be. This covers six cases. 

Now consider the three cases where T2 = T„+ (the transformation acting on 
the position j2 is insertion). If ji > \u\ +j2, then, whatever T2 might be, Tl^T2^ = 
T2T[, where I = ji — \u\ and the statement is satisfied. Hence, assume without 
loss of generality that ji < \u\ + j2- 

If Ti = T^^ for some v E T.~^, we have a situation where u = yz and 

X1X2 I — > Xiyzx2 I — > Xiyvzx2, (3.2) 

for some xl,X2 E E* and y,z E S"*^ and where w{() = g{\yz\) + g{\v\) + 
YlkU ^iVk) + YlkU ^i^k) + YlkU ■^(^fc)- Since the weight of ( depends solely on 
composition and length of inserted fragments and not on the order of generators 
within them, we can set rj = T^j^.T^f^^^ ' where u'v' = yvz and \u'\ = \yz\. 
Clearly, \v'\ = \yvz\ — \yz\ = \v\ and hence w{r]) = w{C). 
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If Ti = T(^a,b} for some a, 6 G S, we have a situation where u = yaz and 

^1^2 ' — ^ Xiyazx2 I — > Xiyozx2, (3.3) 

for some 2; e S* and w{C) = g{\yaz\) + ^^^[-^ s{yk) + EHi ^(^a,.) + 

s{a) + d{a, h). In this case, we can set rj = Ty^^_^_, P^, where 10(7]) = g{\ybz\) + 
ZlL=i ■5(yfe) + ZlLli s{zk) + s{b). As sis right 1-Lipschitz - s(a) < d{a,b)), 
it follows that 10(7]) < w{C). The identity transformation P'-^ = T^^ x ) '^^ there 
so that the form of 77 exactly satisfies the statement of the Lemma. 

If Ti = T„_ for some v E S+, we have a situation where u = yvz and 

^1^2 ' — ^ Xiyvzx2 I — > X;^y2;x2, (3.4) 

forsomex*,X2,?/, 2 G E* such that ?/2 G S"'",andw(C) = g{\yvz\)+Y^^]^-^s{yk) + 
ELli^(^fc) + Elli^K) + K\v\) + EL1i^(^^)- Set rt = T^l^,P^ so that 
^(^7) = 9{\yz\) + Efcii ^iUk) + Efcii ■^(^fc)- Since h, s and t are non-negative 
functions and g is a non-decreasing function, we have 10(7]) < w{C). □ 

Lemma 3.3.26. Let be a set and {t\, w) a set of weighted edit operations on S* 
satisfying the condition N. Then, for any x,y E T,* and any edit script ( E {x ^ 
y}T),, there exists an edit script r] = Ti"'\ . . . , tI^ G {x — y}r^ such that 

^ n—1 1 

j'n < fn-1 ■■■<fi and w{r]) < w{Q. 

Proof Let x, y G S+ and let C = Ti^XZl. • • • , G {x ^ y}^,. We con- 
struct the required edit script rj by using the Lemma [3.3.251 recursively on pairs of 
transformations from C,. 

Set r]Q = C and find the largest k such that is the smallest superscript in 77°. 
If k = m, set -ql = tjq and proceed to the next step. Otherwise, produce a new edit 
script rjl e {x ^ y}r^ such that w{r]l) < w{(), by replacing the pair of terms 
ij+^M;^* in r/i by the pair T^^,Tl^^ where I > jk- By the Lemma [SSm this is 
always possible. 

After this step, jk will remain the smallest superscript in 77^. Apply the same 
procedure to 77^ to produce 7/2 and so on. After at most 772 steps we get an edit 
script 77 = Tfl^^T.-T , • • • , T 1 , with the same number of terms as such that 

is the smallest superscript. 
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To get from rjP to 77^+ , 1 < p < m — 1, repeat the above procedure to the edit 
script rf ~^ . . . , rfto obtain the edit script T^'Ti', T^hi'~\ • • • , T^pC 

and then set r/^+i = 7^- , . . . , 7^™"^+^ , T^^ , T^r"' , • • • , T^vT . After m 

such steps we get r] = r]"" = T{^,T^r^\ where d < < . . . < jf. 

Since the weight did not increase at any step, it follows that w{ri) < w{C). □ 

Theorem 3.3.27. Let T, be a set and {tx,w) a set of weighted edit operations 
on S* satisfying the condition N. Then, for any a;, ?/ G S* and any edit script 
( & {x ^ y}r^ there exists an edit script 6 E {x ^ y}r^ such that 9 admits an 
alignment and w {9) < w{Q. 

Proof Let x, y G S+ and let C = T^X^l, . . . , T^^ g {x ^ y}r,. If C already 
admits an alignment, there is nothing to prove. Otherwise, due to the Lemma 
I3.3.26[ we can assume without loss of generality that < jm-i • • • < ji- Using 
a recursive process starting from we construct an edit script 9 E {x ^ y}r^ that 
satisfies the requirements of the Lemma 13.3.161 and hence admits an alignment. 
We will use the notation 9„ = T?^ T^p™^"', . . . , T§, where p = 0, 1, . . . , iV to 
denote the edit script at each step of the recursion. 

If > 1, set ^0 = Ti^^ ,^),T^"',T^"-S • • • otherwise set 9o = C- For 
each p, let kp denote the largest index such that one of the conditions (iv), (v) 
or (vi) of the Lemma 13.3.161 is not satisfied (which one of the three is violated 
depends on the type of 7^^.^). 

If TjP = Tfb c) for some 6, c G S, the condition (iv) of the Lemma 13.3.161 
requires that jkp = jkp-i — 1- Since the condition (iv) is violated, it must fol- 
low that either j^p < jfcp-i - 1 or jk^ = jk^^i. In the former case, set 9p+i = 

Tf\Tf^-\ . . . , 7;5^ 7^,^ .0' • • • ' ^^^'^ ^ = + 1- Since the in- 
serted transformation is the identity transformation, the weight does not change. 

In the former case there are three possibilities. If T^p ^ = T(a,b) for some 

jp jp ^ 

a,b E T., construct 9p^i by replacing the terms T^.^''^^, T^.^^^^^ in 9p, of total weight 

jp' 

d{b, c) + d{a, b), with a single transformation T^^^^y of weight d{a, c), and leav- 
ing the rest of 9p unchanged. Clearly, since d satisfies the triangle inequality. 
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u!{9p+i) < w{9p). If TjP = Tu+ for some u = bv E S'^, construct 6'p+i by re- 

jP jP 

placing the terms T^!^^^yT^^^_^^ in Op, of total weight d{h, c) + s{h) + s(t'j) with 

a single transformation T^^^, of weight s(c) + Again, w{6pj^i) < w{9p) 

because of the right Lipschitz assumption on s. If T^p = Tu+ for some u = 

kp — 1 

bv E S+, construct 9p+i by replacing the tJ^^^, Z;^^?"' in Op with Z;^!.^ jj;^)^'"'' 
without changing the weight. 

If TjP = T„_|_ for some u E S^, the condition (v) of the Lemma 13.3.161 

kp 

requires that jk^, = jkp-i- Since we assume it is violated, it follows that jk^ < 

Since the inserted transformation is the identity transformation, the weight does 
not change. 

Finally, if TjP = T^- for some n G S^, the condition (vi) of the Lemma 

kp 

133J6|requires that jkp = ikp-i-\u\. Ifjkp < jfcj,-i-|M|, set, without changing the 

weight, Op^, = rf^ , ij'^-' ,...,T^\Tl^^., 7^-^ , . . . , 7f where I = f. + 

If - l^^l < jkp < jkp~i and TiP ^ = for some v E S*, we have a 
situation where u = yz and 

for some xl,xl E S* and z E S'^. Construct 6'p4.i by replacing the terms 
Ty'/_,T^^'y in Op with tJ^,T;;,':!'^'"'' such that u'v' = yvz and = \yz\. 
Clearly, this case is analogous to (13.21) of the Lemma [3.3.25l and. since the weight 
of a deletion also depends only on composition and length of deleted fragments, 
9p+i will have the same weight as 9p. 

If jkp-i -\u\ < Jkp < 3kp-i and TiP^_^ = Ti^a,b) for some a, 6 G S, we have a 
situation where u = ybz and 

Xiyazx2 I — > x-^ybzx2 i — > a;]^a;2, (3.6) 
for some x\, x^, y,z E E*. Construct 9p+i by replacing the terms Tyll_,T^^l'^^^ in 
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6p by a single transformation Ty^l^ . This case is analogous to (13.31) of the Lemma 
I3.3.25l and hence, by the left 1-Lipschitz assumption on t, w{9p+i) < w{9p). 

If jkp-^i - \u\ < < jkp-i and TiP^_^ = for some v G S*, we have a 
situation where u = yvz and 



for some x^, y,z e S*. Construct ^^p+i by replacing the terms Tyll^^T^!'^'^ in 
Op by a single transformation Tyll_ . This case is analogous to (13.41) of the Lemma 
I3.3.25l and. by a similar argument, 6'p+i will have the same weight as 9p. 

Hence, in all cases where one of the conditions (iv), (v) or (vi) of the Lemma 
13.3.161 is violated, we construct a new edit script of no greater weight where all 
transformations up to and including the previously violating transformation now 
fully satisfy the conditions. Depending on the particular type of violation, the 
number of transformations in the new edit script either decreases by one, remains 
the same or increases by one. The only way it can increase is by inserting an 
identity transformation and clearly, there can be finitely many such insertions. 
Thus, the recursion terminates after finitely many steps. It remains to satisfy the 
conditions (i), (ii) and (iii) of the the Lemma [3.3.161 concerning the first edit op- 
eration. This can be achieved by inserting as many of the identity transformations 
as necessary. □ 

Remark 3.3.28. The Theorem 13.3.271 is also valid in the case where (7 = and 
/i = 0, but in that case, in order to satisfy the Definition 13.3. II of (r, w), s and t 
must be strictly positive. 

The Theorem 13.3.271 is a generalisation of the Theorem 4 of [I203II . which 
assumes w{T(^a,b)) = ^, w{Tu+) = g{\u\) and w{Tu-) = h{\u\), where A > and 
(yf, h are positive increasing functions. The functions g and h giving the weights of 
indels are called gap penalties. The most widely used gap penalties are linear, of 
the form g{k) = ak and affine, of the form g{k) = a + bk, where k is the length 
of a gap and a, b are constants. Both linear and affine gap penalties are examples 
of concave functions, satisfying g{k + 1) < g{k) +g{l). Gap penalties of the form 
g{k) = a + b\og{k) have also been proposed lfT4l . 



x\yzx*2 x\yvzx*2 




(3.7) 
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The complexity of dynamic programming algorithms depends on the gap penalty. 
In general, Waterman, Smith and Beyer II203II obtained the 0{rn^n+mn'^) average 
and worst case running time, where m = \x\ and n = \y\. If g and h are linear, this 
can be reduced to 0{nm). The same bounds hold for affine gap penalties using 
the algorithm of Gotoh flTH. 

Tabular computation 

The Theorem [3 . 3 .2 1 l ean be used directly to compute pr^^w{x, y) for any x,y eT.*. 
Let m = \x\ and n = \y\ and let D be an (m + 1) x (n + 1) matrix with rows 
and columns indexed from 0. Suppose w{T(^a,b)) = d{a,b), w{Tu+) = g{\u\) 
and w{Tu^) = h{\u\) where d is a quasi-metric and g, h are positive increasing 
functions. Clearly, (ta, w) satisfies the condition N and hence, by the Theorem 
13.3.271 condition M. 

Set ^0,0 = 0, A,o = mini<fc<i {A-fc,o + h{k)}, 
Dqj = mini<fc<j {Z^oj-fc + g{k)} and for alH = 1, 2 . . . m and j = 1,2 . . .n. 



The form of the recurrence above is the same as in the Theorem 13.3.211 and hence 
p{tx, w){x, y) = Dm,n- The tabular computation approach involves computation 
of Dm,n bottom-up: the values of Di j for all 1 < i < m and I < j < n are 
computed in an increasing row (or column) order. The Example [3.3.291 provides 
an illustration. 

Example 3.3.29. Let S be the English alphabet, let u = COMPLEXITY and v = 

FLEXIBILITY as in the Example [33J81 For all a,b e S, set d{a,b) = if 
a = b and d{a, b) = Aif a ^ b and let g{k) = h{k) = 9 + /c. The matrix (or table) 
D used for computation of the W-S-B distance Pr^.w is given in the Table [3711 - 
observe that p^^^^iiu, v) = -Dio,ii = 29. 




min {A,i-fe + ^(A;)}, 

i<k<j 

min {A-fcj + h{k)} 



82 



CHAPTER 3. SEQUENCES AND SIMILARITIES 










1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


11 








F 


L 


E 


X 


I 


B 


I 


L 


I 


T 


Y 








10 


11 


12 


13 


14 


15 


16 


17 


18 


19 


20 


1 


C 


10 


4 


14 


15 


16 


17 


18 


19 


20 


21 


22 


23 


2 





11 


14 


8 


18 


19 


20 


21 


22 


23 


24 


25 


26 


3 


M 


tl2 


15 


18 


12 


22 


23 


24 


25 


26 


27 


28 


29 


4 


P 


13 


\16 


19 


22 


16 


26 


27 


28 


29 


30 


31 


32 


5 


L 


14 


17 


\16 


23 


26 


20 


29 


30 


28 


32 


33 


34 


6 


E 


15 


18 


21 


\16 


26 


27 


24 


29 


30 


31 


32 


33 


7 


X 


16 


19 


22 


25 


\16 


26 


27 


28 


^29 


30 


31 


32 


8 I 


17 


20 


23 


26 


26 


16 


26 


27 


28 


\29 


30 


31 


9 


T 


18 


21 


24 


27 


27 


26 


20 


30 


31 


32 


\29 


34 


10 


Y 


19 


22 


25 


28 


28 


27 


30 


24 


34 


35 


36 


\29 



Table 3.1: The dynamic programming table used to compute the W-S-B distance between 
the strings COMPLEXITY and FLEXIBILITY. The cells on an optimal path between 
(0, 0) and (m, n) are. shown in bold. 

Traceback 

Computation using a dynamic programming table provides the value of distance 
but often, especially in biological applications, an optimal edit script (need not 
be unique) and the corresponding alignment need to be retrieved. This is most 
easily achieved (at least conceptually) by keeping one or more pointers at each 
entry (z, j) of the dynamic programming table D apart from (0, 0), pointing to the 
entries {io,jo) such that Dij is obtained by summing -Djojo t^e weight of the 
corresponding transformation. An optimal edit script is obtained by following any 
path of pointers from (m, n) to (0, 0) and accumulating the transformations cor- 
responding to each pointer. This procedure is known as traceback. It is clear that 
there exists a 1-1 correspondence between alignments and paths between (0,0) 
and (m, n). 

Example 3.3.30. The path shown in bold in the Table 13.11 corresponds to the 

following alignment: 

COMPLEX ITY 

FLEXBILITY. 
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Note that there exists a second optimal path in this case - it corresponds to the 
alignment in the Example 13 .3 .181 

The correspondence between alignments and paths in the dynamic program- 
ming table suggests an alternative definition of a distance. Let u,v E and 
suppose c? is a non-negative function S x S ^ IR+ such that d{a, a) = and g, h 
are positive functions. Define 

p{u,v)= min y^y^Mafi-d{a,b) + y^Ik-g{k) + y^Dk-h{k), 

alignments of u and v ' ' ' ' ' ' ' ' 

aGE 6gS fc k 

where, as in the Lemma [3.3.17[ Ma^fe = \\i : Ui = a l\Vi = h\\, 

Ik = \{i : Ui = e A \vi\ = k}\ and Dk = \{i : Vi = e A \ui\ = k}\. The condition 

N is the sufficient condition for p to be a quasi-metric. 



3.4 Global Similarity 

An alternative approach to sequence comparison is maximise similarities instead 
of minimising distances. In this case a similarity measure on S and gap penalties 
are used to define the global similarity between two sequences in S*. The compu- 
tation is handled using the Needleman-Wunsch dynamic programming algorithm 
HI 4611 which is very similar to the W-S-B algorithm for computation of distances. 
We define global similarity using a dynamic programming matrix. 

Definition 3.4.1. Let S be a set, x, y e S*, s : S x S ^ M and 51, /i : N+ ^ IR+. 
Let x,y G S* and let m = \x\ and n = \y\. The Needleman-Wunsch dy- 
namic programming matrix, denoted NW(x, y, s, g, h), is an (m + 1) x + 1) 
matrix S with rows and columns indexed from such that Sqa = 0, Si^ = 

meiXi<k<i{Si^k,o - h{k)}, Soj = maxi<fc<j {^oj-fc - 5'(A;)} and for all z = 
1 , 2 . . . m and j = 1, 2 ... n 

Si J = max <^ Si-ij-i + s{xi, yj), max {Si-kj - h{k)} , max {Sij-k - g{k)] \ 

I l<fc<i i<fe<i J 

We define the global similarity between the sequences x and y (given s, g, and h), 
denoted §(x, y), to be the value Sm,n- A 
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Remark 3.4.2. In terms of alignments, we have 



alignments of x and y 



max 




where, as before, Ma,b = \{i : Ui = a A Vi = b}\, Ik = \{i : Ui = e A \vi\ = k}\ 
and Dk = \{i : Vi = e A \ui\ = k}\. The term global is used because the align- 
ments in question are global - in the next section we will examine local similari- 
ties which involve local alignments. 

Remark 3.4.3. Traditionally the gap penalty is a positive function in the case of 
both distances and similarities, being added in one case and subtracted in the other. 
The running times of dynamic programming algorithms still depend on the types 
of gap penalties, as discussed in the section about distances. 

It is also possible to interpret similarities by considering the sets of weighted 
transformations similar to those used to define the W-S-B distance. In this case, 
the set r still consists of weighted transformations of the elements of S* but the re- 
quirement that W{T) = T = / is dropped. In particular, this means that 
each transformation of the form T(^a,a), where a G S, does not need to have weight 
and that the weights of T(^a,a) and T(^b,b) may be different for different a, 6 G S. It 
may be desirable to impose as an additional condition that W{Ti^a,a)) > W{Ti^a,b)) 
for all a ^ b. The definition of {u v}r remains as before and the similarity S 
of two words u and v is defined to be 



For this definition to be equivalent to the one obtained from the Needleman- 
Wunsch algorithm, it is necessary that a condition similar to the condition M is ful- 
filled: there must be at least one optimal sequence of transformations which cor- 
responds to a sequence of transformations considered by the Needleman-Wunsch 
algorithm. This is not always the case in practice (see Section [3]6]below) and one 
then needs to assume in addition that only those transformations acting on each 
alignment position only once are allowed. 



m 
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3.4.1 Correspondence to distances 

The following observation allows conversion of similarity scores to quasi-metrics. 
Lemma 3.4.4 ( 1118111 ). Let X be a set and s : X x X ^ M a map such that 

(i) s{x,x) > \fx e X, 

(ii) s{x,x) > s{x,y) Vx, ?/ G X, 

(in) s{x, y) = s{x, x) A s{y, x) = s{y, y) =^ x = y \/x,y e X, 
(iv) s{x, y) + s{y, z) < s{x, z) + s{y, y) Vx, y.zeX. 

Then : X x X M where (x, y) t— > x) — s{x, y) is a quasi-metric. 
Furthermore, if s is symmetric, that is, s{x,y) = s{y,x) for all x,y G X, {X,d) 
is a co-weighted quasi-metric space with the co-weight w : x s{x, x). 

Proof. Positivity of d is equivalent to (ii), separation of points is equivalent to 
(iii) while the triangle inequality is equivalent to (iv). If s{x, y) = s{y, x) then 
d*{x, y) + x) = s{y, y) - s{x, y) + s{x, x) = s{x, x) - s{x, y) + s{y, y) = 
d*{y,x) + s{y,y) and since s{x,x) > it follows that w : x t-^ s{x,x) is a 
co-weight. □ 

Obviously, if s satisfies all the requirements of the Lemma [3. 4.41 and is sym- 
metric, then —s is a partial metric (Subsection 12.6.31) and the Lemma [3.4.41 is 
equivalent to the Theorem [2.6.15[ 

Lemma 3.4.5. Let be a set and x G S*. If s : T, x J] ^ is a map satisfying 
the conditions ( i) and ( ii) of the Lemma \3.4.4\ g and h are functions N'^ IR+ 
and S = NW(a;, x, s, g, h), then for alii = 0,1, ... , \x\ and for all j < i, 

S'i^i ^ and '^iji ^ 

Proof. We prove our claim by induction. Let ^ denote a partial order on N x N 
where {io,jo) ^ (hj) iiio < i or iQ = i and jo < j (lexicographic order). The 
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relation ^ is well-founded of order type cu'^ (but of course the induction is finite) 
and our claim is trivially true for (0, 0). Assume it is true for all (i', j') -< (z, j). 

If i > and j = 0, we have for some 1 < k < i, Sifi = Si-k,o — h{k) < Si^i 
since Si^kfi < Si^i by the induction hypothesis and h is non-negative. In a similar 
way, it follows that Si^i > So^i since g is non-negative. 

We now consider the case where i > and < j < i and show that Si^i > 
+ s{xi,Xj) we have Si^ij^i < S'i_i,i_i by the induction 
hypothesis and s{xi,Xj) < s{xi,Xi) by the condition (ii), and therefore Si^i > 
Sij. If Sij = Si^kj — h{k) for some I < k < j, the result follows since 
g is a non-negative function and Si^kj < •S'j j by the induction hypothesis. If 
Sij = Sij-k — h{k), the same result follows by the induction hypothesis and 
non-negativity of h. The inequality 5*^ ^ > Sj^i follows by the same argument. □ 

Corollary 3.4.6. Suppose s : S x S ^ M ?5 a function satisfying the conditions 
(i) and (ii) of the Lemma \3.4.4\ g and h are functions N"*" IR+ and S the global 
similarity on S* with respect to s,g and h. Then, for all x G S*, 



Proof. Let x G S*. If x = e, by definition S(x, x) = 0, coinciding with a sum 
over an empty set. For x G the Lemma [3.4.51 directly implies the required 



Theorem 3.4.7. Suppose s : S x S ^ M a map satisfying the conditions of the 
Lemma \3.4.4\ and let g, h be increasing functions N"^ M. Then, the formula 

p{x,y) = S(x,x) - S(x,2/), 

where x,y G S* and § is the global similarity (given s,g and h), defines a r- 
quasi-metric p onT,*. 

Proof. Set d{a, h) = s(a, a) — s(a, b). By the Lemma [3. 4. 4[ d is co-weightable 
quasi-metric with co- weight s(a, a). The Lemma [2.6.71 implies that a co-weight 
function is left 1-Lipschitz. Consider the set {tx,w) of edit operations over S* 



\x\ 




i=l 



result. 



□ 
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where w(T^a,b)) = d{a,b),w{T^+) = g{v) and w{T^^) = h{v) + S{v,v) = h{v) + 
Y^iLi ■5(^*5 ^i)- Let p = Pt^^w By our assumptions, (ta, w) satisfies the condition 
N and hence, by the Theorem 13.3.271 the condition M. By the Theorem 13.3.211 
we have p(xo,i/o) = 0, p(xo,?/j) = mmi<k<j {p{xo,yj-k) + gik)}, p(xi,yo) = 
mini<fc<i {p{xi-k, yo) + h{k) + S{xi-k+i ■ ■ - Xi, Xi-k+i ■ ■ ■ Xi)}, and for all 1 < 
i < \x\, l<j< \y\, 

p{xi, yj) = min |p(xi_i, yj_i) + s{xi, Xi) - s{xi, yj), 
min {p{xi,yj^k) + gik)} , 

^<k<j 

min {p{xi-k, Vj) + h{k) + ^{xi^k+i ---Xi, Xi-k+i ■ ■ ■ Xi)} 

l<k<i 

We claim that for all < z < < j < \y\, p{xi, yj) = §{xi, Xi) — Sij, where 
S = NW{x,y,s,g,h). 

It is clear that p{xo,yo) = Sq^ and that p{xi,yo) = S{xi,Xi) — Si^. By the 
Lemma [3.4.6[ S(a;o, xq) = §(e, e) = and hence p(xo, Vj) = ^{xq, xq) — Sqj. Let 
< i' < m, < j' < n and assume p(xj, yj) = S{xi, Xi) — Sij for all (i, j) such 
that < i < i' and < j < j' but excluding Then, 



p{xi',yj') = min |S(xi/_i, - Si>-ij'-i + s{xi^,Xi^) - s{xi',yji), 
min {S{xi',Xi') - Si',f-k + g{k)} 

min {S(xi'_fc, x^-k) - Se-kj' + h{k) + ^{xy-k+i ■ ■ ■ Xi',Xi'-k+i • • • Xj/)} 

l<fc<i' 



min <^ §{xi>,Xi>) - S'i/_ij'_i - s{xi>,yj> 



min {S{xi',Xi') - Si'j'^k + g{k)} . 

l<k<j' 

min {§{xi',Xi') — Si'-kj' + h{k)} 

l<k<i' ' 
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§{xe,Xi>) - max <^ Si'-ij'-i + s{xi>,yj'), 



max {5i/jv_fc -^(/c)}, 

l<k<j' 



max \Si'-k j' — h(k)} 

i<k<i' 



and our claim follows by induction. In particular, p(x, y) — S(xm, x^) — 5„ 
§(x, x) — §{x, y) as required. □ 



Example 3.4.8. It is well known 118311 that the longest common subsequence prob- 
lem can be approached using similarities rather than distances. Let S be a set and 

set for all a, 6 G S, s(a, a) = 1 and s(a, 6) = if a 7^ 6. Let g{k) = h{k) = for 
all A; G N+. It is easy to confirm that for x, y G S*, S{x, y) = \LCS{x, y)\. 

By the Theorem [3.4.7l d{x, y) = S(x, x) — S(x, y) = \x\ — \LCS{x, y) \ gives 
a co-weightable quasi-metric with co-weight |-|. The metric is the metric pics 
from the Example 13.3.101 The associated order <d is clearly the subsequence 
order: 

X <dy X is a subsequence of y, 

and (S*, <d) forms a meet semilattice where x Hy = LCS{x, y). 

The partial order (S*, <d) is an example of an invariant meet semilattice (Def- 
inition [2]09l) since 

d{x n 2;, ?/ n 2;) = |x n 2;| — |x n ?/ n 2;| < d{x n 2;, x) + (i(x, y) = d{x, y). 

By the Theorem l2.6.20[ the map / = H is a meet valuation and d{x, y) = /(x) — 
fixHy). 
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3.5 Local Similarity 

Presently, most biological sequence comparison is done using local rather than 
global similarity measures. The principal reason is that elements of biological 
function whose detection is desired are usually restricted to discrete fragments of 
sequences and the strong similarity of fragments of two sequences may not extend 
to similarity of full sequences. For example, the structure of a protein consists 
of discrete structural domains interspersed with random coils linking them and 
variation is much higher in the parts not directly related to the function. Thus, even 
relatively closely related protein sequences may show little similarity outside the 
functionally important regions and their global similarity may not be significant. 

The similar phenomenon occurs in DNA sequences, where events other than 
point mutations and insertions and deletions, such as inversions or translocations, 
may occur between very closely related sequences. Therefore, local similarity 
measures, and the associated local alignments between two sequences are most 
appropriate for general comparison of biological sequences. A dynamic program- 
ming algorithm for computation of local similarities, of the same complexity as 
the Needleman-Wunsch algorithm was proposed by Smith and Waterman in 1981 
[I177L While its cubic (quadratic if gap penalties are affine) complexity renders it 
not very suitable for sequential searches of large datasets, it remains the canoni- 
cal yardstick with which the accuracy of any heuristic algorithms is assessed. We 
therefore follow the precedent of the previous section and define local similarity 
between two sequences using a dynamic programming matrix. 

Definition 3.5.1. Let S be a set, x,y e T.* , s : x T. ^ R and g, h : N+ ^ R+. 

Let x,y E S* and let m = |x| and n = \y\. The Smith-Waterman dynamic 
programming matrix, denoted SW(x, y, s, g, h), is an (m + 1) x (n + 1) matrix 
H with rows and columns indexed from such that ifo,o = -f^i,o = Hqj = and 
for alH = 1, 2 . . . m and j = 1, 2 . . . n 





max - h{k)} , max {Hij_k - g{k)} 
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We define the local similarity between the sequences x and y (given s, g, and h), 
denoted 'K{x, y), to be the largest entry of H, that is, !K(a;, y) = maxjj Hij. A 

An optimal edit script and a corresponding alignment is retrieved from Hhy a 
slightly modified traceback procedure: the traceback starts at (z, j) such that Hij 
is maximal and ends at an entry of H with a value of (Example I3.5.2I) . Clearly, 
no traceback is possible if H = 0. 

Two additional requirements are usually associated with the Smith- Waterman 
algorithm: the expected value of s must be negative and at least for some a, 6 G S, 
s(a, b) must be positive. The first requirement obviously requires a probability 
measure on S and exists to ensure that the alignments retrieved are indeed local 
rather than global or close to global. The second requirement ensures that pairs of 
sequences with a positive local similarity score exist. 

Example 3.5.2. Consider the English words u = COMPLEXITY and 

V = FLEXIBILITY from the Example 13.5.21 Suppose s{a, a) = 3, s(a, b) = -1 
if a 7^ 6 and let g(k) = h{k) = 9 + A;. The matrix H = SW(m, v, s, g, h) is given 
in the Table 13.21 The local similarity score is 12 - the corresponding alignment is 
the exact match of the common substring LEX I. 

The local similarity between two words as defined using the Smith- Waterman 
algorithm can be realised as a global similarity between some of their fragments 
(provided there exist two fragments with positive global similarity). Recall that 
we use '^{x) to denote the set of all factors (or fragments) of x G S*. 

Lemma 3.5.3. Let Tj be a set, e S*, s : S x S — > M and g, h : 

M+. Suppose lK(x, y) > 0. Then there exist x' G 5'(x) and y' G ^{y) such 
that IK(a;, y) = S(x', y'), where both global and local similarities are taken with 
respect to s,g and h. 

Proof. Since ^{x, y) > 0, it follows that x,y e E+. We find x' G ^(x), y' G ^{y) 
by traceback. Let H = SW(x, y, s, g, h). By definition of local similarity there 
exist 207^0 such that "K^x.y) = Hi^j^ > 0. We trace back the path of cells of 
the Smith- Waterman dynamic programming matrix from {io,jo) to a zero entry 
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Table 3.2: The dynamic programming table used to compute the Smith- Waterman local 
similarity between the strings COMPLEXITY and FLEXIBILITY. The path recovering 
the optimal alignment is shown in bold. 

by constructing a sequence {{ik,jk)T=o ^^ch that Hi^j^ = ^{x,y), Hi,^,j„, = 
and ik+i < ik, jk+i < jk in the following way. For each k, if Hii^^^^ = stop. 
Otherwise, if Hi^j^ = Hi^^ij^^i + s{xi, yt), set (ik+ijk+i) = {ik - 1, jfc - 1); if 
Hi^Jk = Hi^Jk~i-9{l), set (zfe+i, jfc+i) = {ik,Jk-iy, if Hi^j, = Hi^^ij^ - h{l), 
set (ifc+i, jfc+i) = (ik — l',jk)- Such sequence always exists since Hi^j^ > 0. 
Furthermore, since g and h are non-negative, it follows that im < io and jm < jo- 
Let x' = Xi^+iXi^+2 ...Xi^,y' = yj„,+iyj^+2 ■ ■ ■ yjo and S = NW(x', y', s, g, h). 
Comparing the definitions of global and local similarities, it is easy to see that 

S\x'\,\y'\ = -f^iojo- n 

Corollary 3.5.4. Let J] be a set, x,y e T.*, s : x R and g,h -.N^ ^ IR+. 
Then 

'K{x,y) = max S{x',y') V 0. 

Proof. Let H = SW(x, y, s, g, h) and S = NW(x, y, s, g, h). It can be easily 
verified from the definitions (for example by induction) that for all z, j, Hi j > Sij 
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and therefore for all x' e d{x),y' e d{y),^{x,y) > ^{x',y') > §{x',y'). If 
'K{x, y) > 0, the Lemma [33]3] implies 'K{x, y) < max{§(x', y') \ x' G y' E 

diy)}. □ 

We now present the main result of this chapter which gives the conditions for 
conversion of local similarity scores on a free semigroup to a quasi-metric. We 
first introduce a necessary technical condition. 

Theorem 3.5.5. Let H be a set and f a strictly positive function S ^ M. Let p be 
a metric on S* and let f be the canonical homomorphic extension of f to the free 
semigroup S* given by f{x) = Xllii f (xi) for all x G S"*" and f{e) = 0. Suppose 
that for all x,y E S*, 

|/» - m\ < Pix.y) < fix) + fiy), (3.8) 

and 

fix) - fiy) = p(x, y) ^ ye dix), (3.9) 
then ci : S* X S* ^ M defined by 

dix, y) = fix) - ^ max {fix) + fiy) - y)} 
is a co-weightable quasi-metric with co-weight f. 

Proof. Let x,y E T.*. Since fix) > fix) for any x E dix) and since (13.81) implies 
that / is 1-Lipschitz, it follows that dix, y) > 0. It is also clear that dix, x) = 0. 
If dix, y) = 0, there exists x E ^ix) and y E ^iy) such that 

fix) - \ (/(x) + fiy) - p(x, y)) = 0. (3.10) 

Since x E ^ix), there exist u,v E T.* such that x = uxv and the Equation 13. 101 
becomes 

fiu) + fiv) + ^ifix) - fiy) + pix, y)) = 0. 

Since fiu) > 0, fiv) > and fix) - fiy) + pix,y) > (/ is 1-Lipschitz), it 
must follow that / (m) = 0, /(w) = and 

m- fiy)+pix,y) = 0. (3.11) 
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From f{u) = and f{v) = we conclude that u = e,v = e and x = x while (13.91) 
implies that x = x E d{y)- Hence, since the maximum in the definition of d{x, y) 
is invariant under permutation of x and y, it follows that d{x, y) = d(y, x) = 
implies x = x E ^(y) and y = y E di^) and hence that x = y. 

Now let x,y,z E S* and suppose d{x, y) = f{x) — \ (/ (x) + f{y) — p{x, y)) 
and d{y, z) = f{y) - \ {f{y) + f{z) - p{y, z)) for some x E ^{x), y,y E d{y) 
and z E d{z). Write out y = yiyi+i . ..yi+m-i, y = VjVj+i ■ ■ -yj+n-i where 
m = \y\, n = \y\, l<i<i + m — l<\y \ and l<j<j + n— 1< \y\. 

If y and y overlap, that is, if ? < j < m or j < i < n, let y' denote the 
whole overlapping fragment (for example, if i < j < i + m — 1 < i + n — 1, 
y' = yjyj+i ■ ■ ■ yi+m-i)- If y and y do not overlap or either y or y is identity, let 
y' = e. Since y' E ^{y) and y' E ^{y), by the triangle inequality on p and by 
(|3.9I) . we have 

p{x, y) > p(x, y') - p{y, y') = p{x, y') + f{y') - f{y) and 
p{y, z) > p{y\ z) - p{y\ y) = p{y' , z) + f{y') - f{y). 

Since y' denotes the full extent of overlap of y and y, it follows that 



m + hy') - m - m > o 



and therefore 



d{x,y) + d{y,z) 



= /» - 1 ifi^) + fiy) - p(5^> y)) 

+ny)-l{m+m-piy,z)) 

> m - \ {m + 2f{y) - f{y') - 
+ fiy) - i {2m + m - f{y') - 

> m - 1 {m + m - pi^. y) - 

+ Ry) + Ry') - Ry) - Ry) 



p{x,y')) 

Piy'rz)) 
piy', z)) 



> Rx) - \ {Ri) + Rz) - p(x, z)) 
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> d{x,z). 

The fact that d is co-weightable with co-weight / follows straight from the defi- 
nition of d. □ 

Remark 3.5.6. In general, the property (13.81) means that / can be interpreted as 
a distance from an abstract point with respect to a metric on the set S* U {i^}. 
Flood, in his PhD thesis and a foUowup paper [[59ll , introduced the term norm 
pair to denote the pair (p, /) satisfying the property (13.81) . However, in the context 
of the Theorem 13.5.51 it is clear that f{x) = p{x, e). Hence, the property (13.81) 
can be reformulated to state: for all x E S*, p{x,e) is given by a canonical 
homomorphic extension of a strictly positive function on the set of generators. 

The following Lemma [3.5.71 is a folklore result, see e.g. Flood's paper [|59l , 
but we present the proof for the sake of completeness and because we could not 
find a reference that would be readily available for the reader. 

Lemma 3.5.7 ( 11591 ). Let (X, d) be a metric space and f : X ^ ]R_|_ a positive 
1 -Lip schitz function. Then, the map p : X x X ]R_|_ defined by 

p{x, y) = mm{d{x, y), f{x) + f{y)} 

is a metric. 

Proof. Let x,y,z G X. Clearly p{x,x) = and p{x,y) = p{y,x). Since / 
is positive, p{x, y) = =^ d{x, y) = and hence x = y. For the triangle 
inequality we consider four cases. If p{x,y) = d{x,y) and p{y,z) = d{y,z), 
p{x, y) + p{y, z) > p{x, z) by the triangle inequality of d. If p(x, y) = d{x, y) 
and p{y, z) = f{y) + f{z) we have p(x, y) + p(y, z) > f{x) + f{z) > p(x, z). 
In the case where p(x, y) = f{x) + f{y) and p{y, z) = d{y, z) the result follows 
in the same way. Finally, if p(x, y) = f{x) + f{y) and p(y, z) = f{y) + f{z), we 
have p(x, y) + p{y, z) > f{x) + f{z) + 2f[y) > p{x, z) since / is positive. □ 

Corollary 3.5.8. Let H be a set. Suppose g is an increasing functions N"^ — ^ M, 
h = g and s : S x S ^ M a map satisfying the conditions of the Lemma \3.4.4\ 
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and being symmetric, that is s{b, a) = s{a, h) for all a,h eT,. Let % be the local 
similarity with respect to s, g and h. Then, a function (i : S* x S* ]R_|_ given by 

d{x, y) = 'K{x, x) — 'K{x, y) 

is a co-weightable quasi-metric with co-weight x t— > 3-C(x, x) (equivalently, —IK 
is a partial metric). 

Proof. Let S be the global similarity with respect to s,g and h. Clearly, S is 
symmetric since s is symmetric and g = h. Let pQ{x,y) = §{x,x) — §{x,y) 
for X, y e S* and let So(x) = S(x, x) = J2kU ^i^i^ ^i) (Corollary [3A6l). By the 
Theorem l3.4.7l po is a co-weighted quasi-metric with a co-weight Sq and therefore 
Pq{x, y) = §(x, x) + §{y, y) — §(a;, y) — §(y, x) is a metric and §o is 1-Lipschitz 
with respect to pj}. By the Lemma [3.5.7l p(x, y) = min{pQ(x, y), ^o{x) + So(y)} 
gives a metric. 

It is easy to see that for all x, y G S*, 

S(x, V = i (So(x) + §o(y) - p(x, y)) , 
and hence, by the Corollary I3.5.4[ 

^(a;, y) = 7; max {So(x) + §o{y) - p(x, y)}. 

Furthermore, x) = S(x, x) since s{a, a) > for all a G S. 

The main statement then follows from the Theorem 13.5.51 and the remark of 
— J{ being a partial metric follows from the Theorem l2.6.15[ □ 

Remark 3.5.9. An alternative treatment of the same problem is given in the Topol- 
ogy Proc. paper by the thesis author. There however, a different definition of 
an alignment is given and the statement of the main theorem explicitly uses the 
properties of score matrices and gap penalties. Theorem [333] is a more general 
statement of the same fact. 
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It is clear from the proof of the Theorem |3 . 5 . 5 1 that the partial order <d asso- 
ciated to the quasi-metric d of Corollary I3.5.8l is a substring (factor) order: 



The set S* with <d forms a meet semilattice. However, in general, d is not in- 
variant with respect to the concatenation or meet operation. For example, let 
S = {a, b, c} and for all cr, r G S set 



Let g{k) = h{k) = 10 + k and suppose J{ is a global similarity with respect to s, g 
and h. If X = aabb, y = bbbc and z = aabc, it is easy to verify that x n z = aab, 
y n z = be, d{x, y) = 2 and d{x n z,y n z) = 3 > d{x, y), and hence d is not 
invariant with respect to On the other hand if x = aaab, y = aaa and z = c, 
we have d{x,y) = 1 while d{xz,yz) = 2 and therefore d is not invariant with 
respect to string concatenation. 



The main result from the previous section indicates that, at least under some cir- 
cumstances, free semigroups with local similarity measures can be considered 
as partial metric spaces, or equivalently, as co-weighted quasi-metric spaces. A 
consequence of the Theorem [2. 6.1 5 l of particular significance for biological appli- 
cations is the fact that the transformation into quasi-metric preserves neighbour- 
hoods with respect to similarity scores. 
Let X E T.* and define for some t > 





3.6 Score Matrices 



^(x) = {i/eS*::K(x,y)>t}, 



that is, ^{x) is the set of all points in S* whose local similarity with x is not less 
than t. Retrieving points belonging to such neighbourhoods from datasets is the 
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principal aim of similarity search, explored in detail in Chapter |5l Corollary 13 .5 .81 
implies that there exists a co-weightable quasi-metric d with co-weight w such 
that ^(x) = (i.e. the neighbourhood system consisting of 

for all X and t form a base for a quasi-metrisable topology). Therefore, one can 
expect that existing and newly developed indexing techniques for similarity search 
in (weightable) quasi-metric spaces (see Chapter [5]) can be used to significantly 
speed-up sequence similarity searches without significant sacrifice in accuracy. 
Furthermore, the result makes it worthwhile to repeat the exploration of global 
geometry of proteins performed by Linial, Linial, Tishby and Yona I1126L this 
time in the context of quasi-metrics. 

The current section explores the similarity measures (commonly called score 
matrices for obvious reasons) on DNA and protein alphabets which satisfy the 
Lemma [3. 4.41 and which hence, with affine gap penalties, lead to local similarities 
corresponding to quasi-metrics. In particular, the most popular members of the 
BLOSUM ll88l family of matrices satisfy all the requirements of the Lemma [3.4.4[ 
unlike the members of the PAM family [|45l . which do not and which are therefore 
omitted from the discussion here. 

3.6.1 DNA score matrices 

The DNA alphabet consists of only 4 letters (nucleotides) and the frequently used 
similarity measures on it are very simple. The common feature of all general DNA 
matrices used in practice is that they are symmetric and that self-similarities of all 
nucleotides are equal. The consequence of this fact is that the distance d resulting 
from the transformation d{a,b) = s{a,a) — s{a,b) is always a metric and the 
co-weightable quasi-metric arising from local similarity on DNA sequences has 
co-weight proportional to the length of a sequence. 

For example, the score matrix used by BLAST (more precisely, the blastn 
program for search of DNA database with DNA query sequence) is given by 
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More complex score matrices, mostly distance-based and used in phylogenetics 
also exist. 



3.6.2 BLOSUM matrices 

As the protein alphabet consists of 20 amino acids of markedly different chem- 
ical properties and structural roles, it is to be expected that similarity measures 
on amino acids involved in protein sequence comparison are more complex. The 
BLOSUM family of matrices was constructed by Steven and Jorja Henikoff in 
1992 [[88l who also showed that one member of the family, the BLOSUM62 ma- 
trix, gave the best search performance amongst all score matrices used at the time. 
For that reason, BLOSUM62 matrix is the defauk matrix used by NCBI BLAST 
for searches of protein databases. 

The BLOSUM similarity scores are explicitly constructed as log-odds ratios. 
Let E be a (finite) set and let p be a probability measure on S. The value of p(a) 
is called the background frequency of a G S. Let g be a probability measure on 
E X S. The value of q(a, h) is called the target frequency of a match between a and 
h, that is the likelihood that a is aligned with h in related sequences. For unrelated 
sequences, we expect that the probability of a being aligned with h would be 
p{a)p{b). The similarity score s{a, b) is defined (up to a scaling factor) by 

s{a,b) = log- 



p{a)p{b)' 



Thus, s(a, b) is positive if the target frequencies are greater than background fre- 
quencies, if they are equal and negative if background frequencies are greater. 
In this model, the condition (iv) of the Lemma [3.4.4l (the triangle inequality of the 
corresponding quasi-metric) is equivalent to 

q{a,b)q{b,c) < q{a,c)q{b,b) 

for all a,b,c E E and can be interpreted as stating that a direct substitution of 
one letter to another on each site in the sequence is always preferred to two or 
more substitutions achieving the same transformation. It should be noted that 
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according to Altschul (H, who studied the statistics of scores of ungapped local 
alignments, any similarity score matrix can be interpreted as log-odds ratios (i.e. 
target frequencies can be derived from similarity scores given the background 
frequencies). 

The target frequencies used to obtain the BLOSUM scores were derived from 
multiple alignments. A multiple alignment between n sequences can be defined 
in the similar way as a pairwise alignment between two sequences according to 
the Definition 13. 3. 12t it is only necessary to replace the sequence of pairs with a 
sequence of n-tuples and to adjust the remainder of the definition accordingly. The 
(ungapped) multiple alignments of related sequences (also called blocks) used to 
construct the BLOSUM similarities were obtained from the BLOCKS database of 
protein motifs of Henikoff and Henikoff [891 • 

In order to reduce the contribution of too closely related members of blocks 
to target frequencies, members of blocks sharing at least L% identity were clus- 
tered together and considered as one sequence (for a block member to belong to a 
cluster, it was sufficient for it to share L% identity with one member of the clus- 
ter), resulting in a family of matrices. Thus, the matrix BLOSUM62 corresponds 
to L = 62 (for BLOSUMN, no clustering was performed). After clustering, the 
target frequencies were obtained by counting the number of each pair of amino 
acids in each column in each block having more than one cluster and normalising 
by the total number of pairs. The background frequencies were obtained from the 
amino acid composition of the clustered blocks and log-odds ratios taken. The 
resulting score matrices are necessarily symmetric since the pair (a, h) cannot be 
distinguished from (6, a) in the multiple alignment. 

Most BLOSUM matrices, when restricted to the standard amino acid alphabet 
satisfy the Lemma [3. 4.41 (Table l3.3l) . In fact, the first three conditions are always 
satisfied and only the triangle inequality presents problems. Where it is not sat- 
isfied, it is either in very small number of cases or for small values of L which 
correspond to alignments of distantly related proteins and where it is to be ex- 
pected that a transformation from one amino acid to another can arise from more 
than one substitution. However, it should be stressed that BLOSUM50 and BLO- 
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Matrix 


Failures 


Matrix 


Failures 


Matrix 


Failures 


BLUSUMjU 


A A 

44 


BLUSUMoU 





BLObUMoU 


u 


dLUo UiVl JJ 




dLUo U iVluZ 


A 
U 


dLUo UiVlo J 


u 


BLOSUM40 


6 


BLOSUM65 





BLOSUM90 





BLOSUM45 





BLOSUM70 


2 


BLOSUMIOO 





BLOSUM50 





BLOSUM75 


2 


BLOSUMN 





BLOSUM55 


2 











Table 3.3: Numbers of triples of amino acids failing the triangle inequality in the BLO- 
SUM family of score matrices. Note that all BLOSUM matrices are symmetric and thus 
the number of independent triples is half the number reported. For BLOSUM55, BLO- 
SUM70, and BLOSUM75, the one independent triple failing consists of amino acids I, V 
and A, that is, we have s{I, V) + s{V, A) > s{I, A) + s{V, V). 

SUM62, which are the most widely used score matrices for database searches, do 
satisfy the Lemma [3.4.4[ 

This observation leads to a conclusion that the 'near-metric' of Linial, Linial, 
Tishby and Yona [I126II derived from local similarities based on BLOSUM62 ma- 
trix and affine gap penalties by the formula d{x,y) = + 'K{y,y) — 
2'K{x, y) is in fact a true metric and that the rare instances where the triangle in- 
equality was observed to fail were solely due to non-standard letters such as B,Z 
and X which represent sets of amino acids (for example X stands for any amino 
acid) and whose similarity scores were derived by averaging over all represented 
letters. 

3.7 Profiles 

3.7.1 Position specific score matrices 

From a biological point of view, profiles are generalised sequences. They were 
originally introduced by Gribskov, McLachlan, and Eisenberg iTTSlI in order to 
model the situations where similarity measures based on score matrices do not 
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retrieve all biologically relevant neighbours. As mentioned in Chapter [U the func- 
tion of a protein depends on its structure which in turn depends on its amino acid 
sequence. The structure space is smaller than the sequence space [|142l l95l and 
hence similar structures can arise from quite distantly related (in the evolutionary 
sense) sequences that do not share sufficiently high similarity to be detected us- 
ing score matrix based methods. However, even significantly different structurally 
related sequences often contain a few sites, usually associated with a particular bi- 
ological role, that are strongly conserved across species. Hence the idea of using 
position specific scores to model protein families and find their new members. 

In the sense of Gribskov, McLachlan, and Eisenberg, the term profile can be 
used interchangibly with a term Position Specific Score Matrix or PSSM. A PSSM 
is an n-by-|S| matrix where S is an appropriate finite alphabet (most often the 
set of 20 standard amino acids used in proteins - in fact we will always assume 
this is the case and use 'amino acid' and 'letter' interchangeably). For any PSSM 
M, an entry Mj „ where 1 < i < n and a G S gives the score of the letter a in 
position i. Obviously, entries of a PSSM can come from similarity score matrices, 
that is, from similarities on S. Let x = X1X2 . . . x„ and let s : S x S — M be 
a similarity score function (or matrix since S is assumed finite). Then, one can 
produce a PSSM by setting 

Mi^a = s{xi,a). 

Of course, in this case, the PSSM is really not 'position specific': the scores for 
the same amino acid at different positions are the same. To summarise, PSSMs 
are generalisations of similarity score matrices. 

The score of a sequence with respect to a PSSM is calculated very similarly to 
the usual similarity scores. Let x = x\X2 ■ ■ ■ Xm and let M be an n-by-|S| PSSM. 
If m = n, one can write the score M(x) as 

m 

M{x) = ^Mi,,,, 

i=l 

that is, as an £i-type sum. On the other hand, if m ^ n and gapped local scores 
are desired, a modified Smith- Waterman algorithm can be used. 
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Let g,hhe positive gap penalty functions IR+ and let if be an n + 1-by- 

m+1 matrix indexed from 0. Set i^o.o = -f^i.o = Hqj = and for alH = 1, 2 . . . m 
and j = 1,2. . .n 

Hij = max <^ Hi_ij_i + Mi^^^, max {Hi_kj - h{k)} , max {Hij_k - g{k)} , 

The local similarity score of x with respect to the PSSM M , denoted ^m{x) 
is given by (x) = maxj Hi j. Global similarities can be produced using an 
appropriate modification of the Needleman-Wunsch algorithm. 



3.7.2 Profiles as distributions 

While we have seen that profiles may come from similarity score matrices, they 
are usually produced from collections of related sequences, that is, (putative) 
members of a protein family. Given a (finite) set of sequences' U = {u^}j, we 
first produce a multiple alignment of all of them. For the sake of simplicity, as- 
sume that the multiple alignment is ungapped, that is, only letters are present^, and 
that all sequences have the same length. Clearly, the relative frequencies of letters 
at each position i define a probability distribution qi where qi{a) is the probability 
of an amino acid a occurring at the position i. Given a background amino acid 
distribution p, where p{a) is the overall relative frequency of a, we can define a 
PSSM as a matrix of log odds ratios 

p{a) 

exactly mirroring the definition of the BLOSUM matrices in Subsection l3.6.2[ 

This leads an alternative definition of profiles, used for example by Yona and 
Levitt [I218II . From this point of view, a profile is a sequence of probability distri- 
butions on E, that is, a member of a free semigroup generated by M(S), the set of 

'The index is in superscript rather than subscript in order to distinguish a sequence entry in U 
(m*) and a residue of u at position i (u;). 

^Profile hidden Markov models 1531 further generahse the profiles by modelling gaps as well 
as 'matches'. 
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all probability distributions over S. The two definitions are in fact closely related 
since, given a background distribution p, every sequence of distributions can be 
converted into a PSSM using the Equation (|3.12l) . while it is also clear llSl llOSH that 
scores at each position can be, after scaling, converted to probabilities. Note that 
the scaling factors need not be the same for each position and thus each scaling 
factor can be treated as a 'weight' for the particular position. The log-odds scores 
and the scaling factors have information-theoretic interpretations [|5l 11051 [52l that 
we will not discuss here. 

The definition of profiles as members of M(S)* opens interesting possibilities 
for introducing quasi-metrics for profile-profile comparison. Suppose we have 
a quasi-metric and a positive function on M(S). Then, we can extend them to 
obtain a weighted quasi-metric on M(S)* using dynamic programming and the 
Theorem I3.5.5[ The similarity scores and distances thus obtained would have a 
similar interpretation to the scores obtained from score matrices. Yona and Levitt 
[121 811 produced a profile-profile comparison tool by using the same principles, 
that is, by extending a similarity score function on M(S) to M(S)* using dynamic 
programming. However, it is unclear from their presentation if their score function 
can induce a quasi-metric. 
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Chapter 4 

Quasi-metric Spaces with Measure 



The main object of this chapter study is the pq-space, the quasi-metric space with 
Borel probability measure (or probability quasi-metric space) which we introduce 
here for the first time. As most of the theory of the measure concentration was de- 
veloped within the framework of a metric space with measure, we will throughout 
this chapter state the definitions and results for the metric case first and then give 
the corresponding statements for the quasi-metric case. The proofs will be given 
only for the quasi-metric case (as they include the metric case) and where they 
are not available elsewhere. For an extensive review of the theory for the metric 
case the reader is referred to the excellent monograph by Ledoux [I121II . Chapter 
of the well-known Gromov's book [[79| as well as the book by Milman and 
Schechtman [I138II which mainly concentrates on the normed spaces. 

We aim to explore the phenomenon of concentration of measure in high di- 
mensional structures in the case where the underlying structure is a quasi-metric 
space with measure. Many results and proofs can be transferred almost verbatim 
from the metric case. However, we also develop new results which have no metric 
analogues. 

4.1 Basic Measure Theory 

Let 17 be a set. A collection A, of subsets of VL, is called a a-algebm if it satisfies 
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(i) neA, 

(ii) ifAeA then n\Ae A, 

(iii) if A = Ak with Ak G for all k, then A e ^. 

Let S be a collection of subsets of Q. The cr-algebra generated by §, denoted 
a(S), is the smallest a-algebra containing S (one cr-algebra containing § always 
exists: the power set 7(^2)). 

A function ji : A ^ such that |u(0) = is a measure on A if it is additive, 
that is if 

k>l k>l 

for all pairwise disjoint sets A^ E A. A measure space is a triple (fi, A, /i) where 
f2 is a set, yi is a cr-algebra and /i is a measure. A probability space is a measure 
space with total measure //(fi) = 1. 

Let [Vt, A, jj) be a measure space. The measure ji is called a-finite if there 
exists a countable collection of sets {^li}"^-^ such that 1] = IJi^i f^i^i) < 

oo for each i. 

The Borel a-algebra on a topological space {X, 7) is the smallest cr-algebra 
containing T. The existence and uniqueness of the Borel algebra is shown by 
noting that the intersection of all cr-algebras containing T is itself a cr-algebra, so 
this intersection is the Borel algebra. The elements of the Borel cr-algebra are 
called Borel sets while the measures on cr-algebras are called Borel measures. 

The Borel cr-algebra may alternatively and equivalently be defined as the small- 
est cr-algebra which contains all the closed subsets of X. A subset of X is a Borel 
set if and only if it can be obtained from open (or closed) sets by using the set op- 
erations union, intersection and complement in countable number, more exactly 
via transfinite recursion in countable ordinals. 
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4.2 pq-spaces 

Definition 4.2.1. A topological space {X, T) is called Polish if it is separable and 
metri sable by means of a complete metric. ▲ 

We recall the definition of a metric space with measure, as defined in [[8TI . 

Definition 4.2.2 ([[811 EH |80l). An mm-space is a triple (X, d, /i) where (X, d) is 
a Polish metric space and /i a a-finite Borel measure on X. 

An mm-space where = 1 is called a pm-space . ▲ 

We shall mostly be concerned with mm-spaces equipped with finite measures 
and will assume wherever possible that the measure has been normalised so that 
they become pm-spaces. 

In order to define an analogue for a quasi-metric space (X, d) we observe that 
it is not sufficient to use the Borel cr-algebra generated by T((i) since we want to 
have the open and closed sets with respect to both 7{d) and 7{d*) measurable. 
Hence, we use the Borel cr-algebra generated by 7{d) U 7{d*). It is easy to see 
that this structure is equivalent to the Borel cr-algebra generated by 7{d^), the 
topology of the associated metric, by observing that 03e(a;) = ^^(x) fl (x) 
(Remark |22i2l)- 

In order to make our definition fully analogous to the the definition of the mm- 
space, we additionally require that our quasi-metric be bicomplete, that is, that its 
associated metric be complete. 

Definition 4.2.3. Let (X, d) be a bicomplete separable quasi-metric space, and /i a 
cr-finite measure over B, a Borel cr-algebra of measurable sets generated by 7{d^) 
where d^ is the associated metric to d. We call the triple (X, d, fi) an mq-space. If 
in addition //(X) = 1 we call such triple a pq-space. 

Furthermore, we call the mq-space (X, d*, /i) the conjugate or dual mq-space 
to (X, (i, n) and the mm-space (X, c?^, n) the associated mm-space to (X, /x). A 

Henceforth, we shall always use the symbol S in the context of mq-spaces to 
denote the underlying Borel cr-algebra. 
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Remark 4.2.4. The fact that {X, d^, jj), the associated mm-space to (X, d, jj), is 
an mm-space indeed is a direct consequence of having the Borel a-algebra of 
measurable sets generated by T^d^). 

In this work we shall only consider pq-spaces, that is, the quasi-metric spaces 
with finite measure. The definition of an mq-space was introduced in order to 
correspond to the definition of an mm-space as given by Gromov ir79l[80l . 

In order to illustrate one possible way of interaction between a quasi-metric 
and measure we give another example of Lipschitz functions. 

Lemma 4.2.5. Let {X,d,fi) be a pq-space and < p < 1. The function pp : 
X — > M, where Pp{x) = inf{r > : /i(*B^(x)) > p}, is left 1-Lipschitz, while 
Pp* : X ^ R, where Pp*{x) := inf{r > : p{^^{x)) > p], is right 1-Lipschitz. 




Figure 4.1: pp function. 



Proof Since K(.,y)+pM ^ ™' ^as 

and it follows that pp{x) < d{x, y) + Pp{y) and therefore Pp{x) — Pp{y) < d{x, y). 
The second statement follows in a similar manner. □ 
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Recall the definition of the concentration function for an mm-space. 

Definition 4.3.1. Let {X, d, jj) be an mm-space and S the Borel cr-algebra of fi- 
measurable sets. The concentration function a(^x,d,fi), also denoted a, is a function 
M+ [0, |] such that a(x,d,^l) (0) = | and for all e > 

= sup |l - ii{Ae); A G B, ^{A) > ^| . 

▲ 

The concentration function measures the maximum size of a complement 
('cap') of a neighbourhood of a Borel set of a measure not less than i. In a 
sense to be made more precise later, a space is 'concentrated' if its concentration 
function is extremely small for small e. 

As before with asymmetric structures, we introduce two concentration func- 
tions on a pq-space, left and right. 




Figure 4.2: Left concentration function a^. 



Definition 4.3.2. Let {X, d, jj) be a pq-space and S the Borel cr-algebra of ji- 
measurable sets. The left concentration function ot^^xdjiy denoted a^, is a 
map ]R+ ^ [0, |] such that ^ (0) = \ and for all e > 

«fx,d,M)(^) = sup|i-M^.''); ^es, M^)>U. 
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Similarly, the right concentration function ^ also denoted a^ , is a map 
M+ [0, i] such that = I ^"^^ all £ > 

«fxAM)(^) = sup |l - ); AGS, /x(A) > i| . 

A 

Remark 4.3.3. For an mm-space (X, d, /i), and are equal and they coincide 
with the usual concentration function a(^x,d,ti)- It is also easy to observe that for a 
pq-space (X, d, /i), 

L Ft 

The concentration functions and respectively measure the maximum 
size of the complement to any left and right neighbourhood of a Borel set of a 
measure not less than ^ (Fig. 14.21) . 

Lemma 4.3.4. For any pq-space (X, d, fi), the concentration functions a^-^ ^ 
and (yfxdfi) '^^^ decreasing and converge to Q as e — > oo. Furthermore, if 
diam(X) is finite, then for alls > diam(X), a^{e) = a^{e) = 0. 




Figure 4.3: can take as much mass as required. 



Proof. We prove the statement for a^. It is obvious that is bounded below by 
and decreasing since A^^ C A^^ and hence /i(Aey) < /^(^^J for any Borel set 
A and ^ < < Si. Thus the limit exists and is non-negative and we now show 
that lim^^oo = 0. 
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Take any < 5 < |. We need to show that there is some > such that for 
all e > Eq and for any Borel set A such that /i(v4) > | we have /^(Ae) > 1 — 5 
(this is trivially true for S > |). Take any xq E X. We will show that there exist e' 
such that for all e > e' , /i(*B£(xo)) > 1 — 5. Indeed, taking the open balls *B„(xo), 
n G N+ with respect to the associated metric we have 

limsup/x(<B„(a;o)) = lim /i («Bi(a;o)) + A* (23i+i(a;o) \ ^^^(xo)) 

oo 

= fi (Q3i(xo)) + 5^ /i (55i+i(xo) \ 53,(xo)) 

n=l 

= /i(^) = 1 

by cr-additivity of measure. Thus there is some uq E N+ such that for all n > uq, 
jjL (Q3„(xo)) > 1 — 5. Now take any Borel set A of measure greater than |. A must 
intersect (xq) (Figure l43l) because if it would not, we would have < 5 < 
I leading to a contradiction. It now clear that for any e > diam (*B„o(xo)) = 2no 
we have A^ D (xq) . Indeed, let a G A and h E *B„q (xq) . Then by the triangle 
inequality 

(i(a, 6) < (i(a, Xq) + d{xQ, h) 

< d^{a,Xo) + d^{xo,b) 

< Tio + no = 2nQ. 

Therefore, for any e > 2no, /i (A^) > /i (*B„o(xo)) > 1 — 5 as required. It is 
obvious that the same proof would work for by substituting A^ by Af above. 

It is also clear that if diam(X) < oo, then for any e > diam(X) and any 
ACX,X = A^ = Af and hence a^{e) = a^{e) = 0. □ 

The following lemmas show some relations between the various alpha func- 
tions. 

Lemma 4.3.5. For any pq-space {X, d, /i), for each e > 0, 
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Proof. Let A e 3 be such that fi{A) > \ and let e > 0. Using C n A^, 

1 - ;u(y4^) < 1 - [i[A^) < a{e) =^ a^{e) < a{e) and 
1 - fi{Af) < 1 - fi{A,) < a{e) =^ a^{e) < a{e), 

and it follows that nia,x{a^{e), a^{€)} < 

For the second inequality, use the fact that A^ ^ A^ Ci A^, and thus X \ C 
{X\A^) U implying 

1 - < (1 - + (1 - /.(Af )) < + 

□ 

It is easy to see that the above inequalities from the Lemma 14.3.51 are strict. 
Consider the following example. 




Figure 4.4: Space where max{a^(e), Q;^(e)} < a{e). 



Example 4.3.6. Let X = {a,b,c} where d{a,b) = d{b,c) = 1, d{c,b) = 
d{b, a) = 2, d{a, c) = 2 and d{c, a) = 4. Set an additive measure in the fol- 
lowing way: = /i({c}) = | and /i({&}) = | (Figure l44l) . It is clear that 
(X, d, /i) is a pq-space and that 



ife = 

ifO < £ < 1 
if 1 < £ < 2 
if £ > 2 



4.4. DEVIATION INEQUALITIES 



113 



On the other hand 



i if£ = 



if < £ < 2 



ife>2 

Hence for 1 < e < 2 we have max{a;^(£:), a^{e)} < a{e). 

The phenomenon of concentration of measure on high-dimensional structures 
refers to the observation that in many metric spaces with measure which are, in- 
tuitively, "high dimensional", the concentration function decreases very sharply, 
that is, an ^-neighbourhood of any not vanishingly small set, even for very small 
e, covers (in terms of the probability measure) nearly the whole space. Examples 
are numerous and come from many diverse branches of mathematics I1135[ [8T1 IH 
I138ir79iri55lll85ll . Here we take a "high dimensional" pq-space to be a pq-space 
where both and decrease sharply. 



4.4 Deviation Inequalities 

Definition 4.4.1. Let (X, S, /i) be a probability space and / a measurable real- 
valued function on (X, d). A value m /- is a median or Levy mean of / for fi if 

< m^}) > ^ and/x({/ > nif)} > i 

▲ 

A median need not be unique but it always exists. The following lemmas are 
generalisations of the results for mm-spaces. 

Lemma 4.4.2. Let {X, d, jf) be a pq-space, with left and right concentration func- 
tions and respectively and f a left 1 -Lipschitz function on {X, d) with a 
median rrif. Then for any e > 

fi{{x G X : f{x) < TUf — e}) < a^{e) and 
fi{{x G X : f{x) >mf + e}) < a^{e). 
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Conversely, if for some non-negative functions and : IR+ M, 

G X : f{x) < nif — e}) < a^{e) and 
G X : f{x) >mf + e}) < ^^(e) 

for every left 1 -Lip schitz function f : X ^ M. with median nif and every e > 0, 
then a'" < and < a^. 

Proof. Set A = {x E X : f(x) > ruf}. Take any?/ G X such that /(y) < uif—e. 
Then, for any x E A, d{x, y) > f{x) — f{y) > e and hence d{A, y) > e, implying 
2/ G X \ v4^. Therefore, ^{{x E X : f{x) < m/ - e}) < 1 - /i(A^) < a^{e). 

Now sei B = {x E X : f(x) < ruf}. Take any y E X such that f(y) > 
rrif + e. Then, for any x E B, d{y, x) > f{y) — f{x) > e and hence d{y, B) > e, 
implying?/ G X\Bf. Thus,^({x G X : f{x) > nif +6}) < l-^{Bf) < a\e). 

The converse is equivalent to finding for each Borel set A C X such that 
/^(^) > \, left 1-Lipschitz functions / and (/ : X — M with medians nif and 
nig respectively, such that 1 — yu(y4^) < G X : f{x) < nif — e}) and 

1 - fi{Af) < fi{{x E X : g{x) > nig + e}). 

Let A C X he such a set such that fi{A) > | and set for each y E X, 
f{y) = ~d,{A, y) and g{y) = d{y, A). It is easy to see that both / and g are left 
1-Lipschitz and that nif = nig = 0. If y E X \ A^, we have d{A, y) > e and thus 
fiy) < Similarly, if y G X \ A^, we have d{y, A) > e implying g{y) > e 
and the result follows. □ 

Hence, we can state the alternative definitions of and a^: 

a^{e) = sup G X : f{x) <nif — e}) : f is left 1-Lipschitz} 

and 

a^{e) = sup G X : f{x) > nif + e}) : / is right 1-Lipschitz}. 

Similar results can be easily obtained for the right 1-Lipschitz functions by 
remembering that if / is a right 1-Lipschitz, — / is left 1-Lipschitz (Lemma[2A3]). 
It is also straightforward to observe that the absolute value of deviation of a 1- 
Lipschitz function from a median thus depends on both and a^. 
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Corollary 4.4.3. For any pq-space (X, (i, /z), a left 1-Lipschitz function f with a 
median m / and e > 

K{\f-mf\> ^}) < "fx,d,M)(^) + "fx,d,/.)(^)- 

This result reduces to the well-known inequality n{{\f — rnjl > e}) < 2a{e) 
when (i is a metric. Deviations between the values of a left 1-Lipschitz function at 
any two points are also bound by both concentration functions. 

Lemma 4.4.4. Let {X,d,fi) be a pq-space and / : X ^ M a left (or right) 1- 
Lipschitz function. Then 

(/i ® y)eXxX: f{x) - f{y) > e}) < (|) + (| 

Proof. 

0^i){{{x,y)EXxX: f{x) - f{y) > e}) 
<{fi^fi) eXxX: f{x)-mf > |}) 

+ (/i ® //) ({(x, y) e X X X : - /(y) > |}) 



(|x e X : /(x) > m/ + |}) + /i e X : /(x) < 



< ( - ) + a-^ 



□ 



4.5 Levy Families 



Definition 4.5.1. A sequence of pq-spaces {(X„, is called left Levy 

L 



family if the left concentration functions afy , converge to pointwise, that 



IS 

Ve > 0, «fx„,d„,^„)(£) ^ as n ^ oo. 

Similarly, a sequence of pq-spaces {(X„, dn, fJ'n)}^=i is called right Levy fam- 
ily if the right concentration functions a^^^ converge to pointwise, that is 

Ve > 0, afx„,d„,^„)(£) ^ as n ^ cx). 
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A sequence which is both left and right Levy family will be called a Levy fam- 
ily. Furthermore, if for some constants Ci, C2 > Gone has an(e) < Ci exp(C2£:^n), 
such sequence is called normal Levy family. A 

It is a straightforward corollary of Lemma 14.3.51 that a sequence of pq-spaces 
is a Levy family if and only if the sequence of associated mm- 
spaces d^, is a Levy family. 

To illustrate existence of sequences of pq-spaces which are right but not left 
Levy families consider the following example. 

Example 4.5.2. Let X = {a, b} with = | and = |. Set d„(a, b) = 

1 and dn{b, a) = ^ where n G N+.(Fig. |431) . 




Figure 4.5: Spaces Xn where — > as n ^ cx) but does not. 



It is clear that 



2' 



if 6 = 



i if < £ < 1 and a^{e) = < 



0, if£>l, 



|, ife = 

i ifO<e<i 

3 ' — n 

0, ife>i. 



Hence, converges to pointwise while does not. In this case an = a^- 



Examples of Levy families of mm-spaces abound in many diverse areas of 
mathematics. We only mention a few. 



4.6. HIGH DIMENSIONAL PQ-SPACES ARE VERY CLOSE TO MM-SPACESl 17 

Example 4.5.3 (Maurey 1113511 ). The sequence {(5'„, dn, /in)}^i where Sn is the 
group of permutations of rank n, dn is the normalised Hamming distance given by 

dn{(7,T) = -\i: (T{i) ^ r(z)| , 



forms a normal Levy family with the concentration functions satisfying 



Example 4.5.4 (Levy [fT23ll ). The family of spheres §" c ]R"+^ with the geodesic 
metric and the rotation invariant measure forms a normal Levy family where 



Example 4.5.5 (Gromov and Milman [[8TI ). The special orthogonal group SO{n) 
consists of all orthogonal n x n matrices having the determinant 1. The family of 
these groups with the geodesic metric and the normalised Haar measure forms a 
normal Levy family where 



The hamming cube, discussed in Subsection 14.7.11 provides another example 
(Proposition |4/74|). 

4.6 High dimensional pq-spaces are very close to 
mm-spaces 

Most of the above concepts and results are generalisations of mm-space results. 
However, we now develop some results which are trivial in the case of mm-spaces. 
The main result is that, if both left and right concentration functions drop off 
sharply, the asymmetry at each pair of point is also very small and the quasi-metric 
is very close to a metric. 



n 



and Hn is the normalised counting measure where 



<2exp(-£V64). 
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Definition 4.6.1. For a quasi-metric space {X,d), the asymmetry is a map T : 
X X X ^Mdefinedbyr(x,?/) = \d{x,y) -d{y,x)\. A 

Obviously, F = on a metric space. However, F is also close to for high 
dimensional spaces, that is, those pq-spaces for which both and decrease 
sharply near zero. 

Theorem 4.6.2. Let {X, d, jj) be a pq-space. For any 5 > 0, 

(/X ® ^l){{{x,y) e X X X ■ F(x, y) > e}) < (|) + (|) . 

Proof. Fix a E X and set for each x E X, 'jaix) = d{x,a) — d{a,x). It is 
clear that 7^ is a sum of two left 1-Lipschitz maps and therefore left 2-Lipschitz. 
Furthermore, zero is its median since there is a measure-preserving bijection 

(x, y) t— s> {y,x) which maps the set {{x,y) E X x X : d{x,y) > d{y,x)} 
onto the set {{x,y) E X x X : d{x,y) < d{y,x)}. By the Lemma |4.4.2[ 
E X : |7a(a;)| > e}) < (|) + (|). Now, using Fubini's theorem, 

EX xX : \d{x,y) -d{y,x)\ > e}) 

/ hiAy)\>e}My)Mx) 
ax Jyex 




Thus, any pq-space where both and (equivalently, by the Lemma l4.3.5[ 
a) sharply decrease are, apart from a set of very small size, very close to an mm- 
space. 

If we restrict ourselves to longer ranges, that is, bound the distances d{x, y) 
from below, then more precise bounds for the difference d{x, y) — d{y, x) can be 
obtained. 

Corollary 4.6.3. Let (X, d, /i) be a pq-space and < e < S < 00. Then, for any 
pair {x,y) E X X X such that 6 < d{x, y), apart from a set of (fx® fx) measure at 
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most 1 — «^(|) — values d{x, y) and d{y, x) differ by a factor of less 

than 1 + e/5. More precisely, 

E E 

(l - -)d{x,y) < d{y,x) < (l + -)d{x,y). 

Proof. By the previous theorem, for any e > 0, apart from a set of measure at most 
1— the values of d{x, y) and d{y, x) differ by less thane. The result 
now follows by rearrangement of the inequality \d{x, y) — d{y, x)\ < e. Indeed, 

if d{x,y) < d{y, x), we have d{y,x) < (l + d{^)^(^^y) ^ + l)dix,y). If 
d{y, x) < d{x, y), then d{y, x) > (l - ^(^)c?(a;, y) > [l - y). □ 



4.7 Product Spaces 

4.7.1 Hamming cube 

Definition 4.7.1. Let n G N and E = {0, 1}. The collection of all binary strings 
of length n, denoted is called the Hamming cube. A 

Definition 4.7.2. The Hamming distance (metric) for any two strings <j = aia2 . . .a„ 

and T = T1T2 . . . r„ G S" is given by 

dn{a,T) = \{i eN:ai^ n}] . 

The normalised Hamming distance pn is given by 

d{a,r) \{ieN:aiy^ri}\ 



Pn[Cr, T) 



n n 



Definition 4.7.3. The normalised counting measure of any subset A of a Ham- 
ming cube is given by 
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It is easy to see that the above definitions indeed give a set with a metric and 
a measure and that (S", p„, is a pm-space. One may wish to consider S" as a 
product space with p„ as an £i-type sum of discrete metrics on {0, 1} and fin an 
n-productof /ii, where /ii({0}) = /ii({l}) = ^. 

The following bounds to the concentration function on the Hamming cube 
were stated in the book by Milman and Schechtman I1138II (Section 6.2): 

Proposition 4.7.4. For any Hamming cube E'^ with the normalised Hamming 
distance pn and the normalised counting measure fin, we have 



Law of Large Numbers 

Hence a sequence {(S"^, p„, is a normal Levy family. An easy conse- 
quence of the Proposition l4.7.4l is the well-known Law of large numbers. 



Proposition 4.7.5. Let (e)j<Ar be an independent sequence of Bernoulli random 
variables (P{e = 1) = P(e = -1) = ij. Then for all t> 



□ 




Equivalently, if Bn is the number of ones in the sequence {e)i<N then 




□ 



Asymmetric Hamming Cube 



We will now produce a pq- space based on the Hamming cube by replacing p„ 
by a quasi-metric. The simplest way is to define di : S ^ M by (ii(0, 1) = 1 
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andrfi(l,0) = rfi(0,0) = rfi(l,l) = and set rf„((T, r) = ^ EILi ^i)- 
The triple (S", dn, forms a pq-space. It would not add much to generality to 
replace by a product of copies of a different probability measure on S. One 
immediately observes that {(S", dn, /in)}i^i is also a normal Levy family. 

Take two strings a and r and let us consider the asymmetry r„((T, r) . It is easy 
to see that takes value between and 1, being equal to the quantity 



\{i:ai = OAr, = 1}\ - \{i : ai = 1 A n = 0}| 



Since our asymmetric Hamming cube is a product space, we can consider for 
each i < n the value Si = d(ai, Ti) — d(Ti, Oi) as a random variable taking values 
of 0, -1 and 1 with P{5i = 0) = i and P{5i = -1) = P{6, = 1) = i so that 
rn(a,r) = iE.<n|5.l-Now, 

/in ® /in({(a, r) G X : r„(a, r) > e}) = P( i > e) 



n 

i<n 



<2exp| - — ). 

This is obviously the same bound as would be obtain by application of the 
Theorem 14.6.21 and the Proposition l4.7.4[ 



4.7.2 General setting 

Product spaces assume great importance in the present investigation for two rea- 
sons. Firstly, the theory of concentration there is quite extensively developed, 
mostly due to the work of Michel Talagrand II1831I1841 . Many of his results are 
quite general, that is, not restricted to the products of metric spaces, and can be 
applied directly to the quasi-metric spaces. Secondly, the space of protein frag- 
ments, the main biological example of this thesis, can be modelled as a product 
space, although the measure on it is definitely not a product measure. However, 
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the bounds on the concentration function thus obtained can be used as a worst case 
estimate which can be useful in indexing applications. 

It should also be noted that the generality of the results means that they can 
even be applied to the similarity scores that do not transform into quasi-metrics 
(i.e. which do not satisfy the triangle inequality). 

Talagrand [I183II obtained the exponential bounds for product spaces endowed 
with a non-negative 'penalty' function generalising the distance between two points 
Penalties form a much wider class of distances than quasi-metrics but provide 
ready bounds for the concentration functions. 

We will outline here just one of results from II183II and apply it to obtain 
bounds for concentration functions in product quasi-metric spaces with product 
measure. 

Consider a probability space (fi, S, ji) and the product (fi^, /i^) where the 
product probability fi^ will be denoted by P. Consider a function / : 2^^ x Q'^ 
M+ which will measure the distance between a set and a point in More 
specifically, given a function h : x ^ IR+ such that h{uj,uj) = for all 
u E il, set 

f{A, x) = inf < ^ h{x^, yi);y e A 

U<Af 

Theorem 4.7.6 ( II183II ). Assume that 

\\h\\oo = sup h{x,y) 

is finite and set 

\\h\\2= i^j j h'^{uj,uj')d^{uj)dfi{uj') 

Then 

□ 
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If we take as h above (Iq, a quasi-metric on f2, and endow with the £i-type 
quasi-metric d so that x,y e Q'^, d{x, y) = Y.i<N dn{xi, yi), we have \\dn\\^ = 
diam(f2) and f{A, x) = d{x, A). Hence, the following corollary is obtained. 

Corollary 4.7.7. Suppose diam(f2) < oo. Then 



Note that the bound applies to a and hence to both and because the 
norms referred to above are symmetric. 

An advantage of an inequality of this sort in applications to the biological 
sequences is that ||gn|l2 can be easily calculated for a finite alphabet Vl. On the 
other hand, it is remarked in I1183II that the constants above are not sharp. 

Example 4.7.8. Consider the pq-space X = (S^, d, jj,^) where S is the amino 
acid alphabet, d is the -quasi-metric extended from the quasi-metric rfs on S 
and /i is a probability measure on amino acids. Then, the Corollary 14.7.71 provides 
explicit bounds for the concentration functions on X. 

In particular, if d^. is the quasi-metric obtained from the BLOSUM62 simi- 
larity scores and /i is obtained from the amino acid counts from a large protein 
dataset (they differ very little if the dataset is general enough; specifically take the 
counts from the NCBI nr dataset described in detail in Subsection 16. 1.1 1) , we have 



diam(S) = 15 and \\d,:\\l = E.eE E.eE ^)/"({^})/"({^}) = 45.0193. 



While the above would give an explicit formula for the bounds of the concen- 
tration functions on the space of peptide fragments under the assumption that 
the measure on is a product measure, one would ultimately wish to estimate 
the 'true' concentration functions on - this is something we do not yet know 
how to do. Indeed, were it to be attempted directly from the definition, by choos- 
ing a subset and computing the measure of its ^-neighbourhood one at a time, the 
computational complexity would be exponential in the size of the set. 




□ 



CHAPTER 4. QUASI-METRIC SPACES WITH MEASURE 



Chapter 5 

Indexing Schemes for Similarity 
Search 

5.1 Introduction 

It would not be exaggerated to state that database search is one of the pillars 
of the modem information society. Datasets come in many forms, from simple 
flat-files to relational databases. Classical databases are structured around data 
points {records) with keys which may contain numeric, textual or categorical data, 
allowing comparison and search queries. The most fundamental type of search 
queries is exact match - all datapoints matching a given key are retrieved. If the 
type of the key is numeric, it is possible to perform range queries where the set of 
points within a given range of the query key is retrieved. If the key is a string, a 
partial match query can be asked: it retrieved those datapoints whose keys match 
the query key in part (for example, by sharing a common prefix). In all cases an 
additional structure such as for example linear order is imposed on data keys to 
facilitate retrieval of queries. 

Sometimes it is possible to assume that datapoints belong to an n-dimensional 
vector space with the coordinates corresponding to iheix features. In this case, 
exact matches are often not sufficient: unless the underlying space is strictly lim- 
ited in some way, the probability that there will be a datapoint exactly matching 
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a query is close to 0. On the other hand, before proceeding with range queries, it 
is necessary to define a similarity or proximity measure used to retrieve queries, a 
function of two variables that on input of the query and some other point returns 
their similarity (degree to which the points are similar) or distance (in this case 
it is commonly called a dissimilarity measure). For n-dimensional vector spaces 
the obvious choice of a dissimilarity measure is an or Minkowski metric where 
d{x,y) = (X]"=i IVi ~ ^if)^ or its weighted modifications where each coordinate 
is assigned a weight. 

The approach of retrieving points according to a similarity measure can be ap- 
plied to datasets which cannot be easily represented as vector spaces, for example 
sets of words from a finite alphabet, colour images, time series, audio and video 
streams etc. Such sets are often large, complex (both in the structure of data and 
the underlying similarity measure) and fast growing. One well known example 
is GenBank [[T5l . the database of all publicly available DNA sequences (Figure 
15.11) . In this case, the size of queries is much smaller than database size and it is 
imperative to attempt to avoid scanning the whole dataset in order to retrieve a 
very small part of it. 

Loosely speaking, indexing denotes introduction of a structure, called indexing 
scheme, to a dataset. This structure supports an access method for fast retrieval of 
queries by enabling elimination of those parts of the dataset which can be certified 
not to contain any points of the query. There are numerous examples of indexing 
schemes and access methods, the best known being the B-Tree p2| from the clas- 
sical database theory. However, in order to design new and efficient indexing 
schemes, a fully developed mathematical paradigm of indexability that would in- 
corporate the existing structures and possess a predictive power is needed. 

The master concept was introduced in the influential paper by Hellerstein, 
Koutsoupias and Papadimitriou lISTl : a workload, W, is a triple consisting of 
a search domain fi, a dataset X, and a set of queries, Q. An indexing scheme 
according to [[87l is just a collection of blocks covering X. While this concept 
is fully adequate for many aspects of theory, we believe that analysis of indexing 
schemes for similarity search, which is the aim of this chapter, with its strong 
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Figure 5.1: Growth of GenBank DNA sequence database (log scale). Data taken from 
|http : / /www ■ ncbi . nlm. nih . gov/Genbank/genbankstats . html 

geometric flavour, requires a more structured approach. Hence, a concept of an 
indexing scheme as a system of blocks equipped with a tree-like search structure 
and decision functions at each step is put forward. This concept is a result of 
analysis of numerous concrete existing approaches to indexing. The notion of a 
consistent indexing scheme, guaranteeing full retrieval of all queries, is stressed. 

The notion of a reduction of one workload to another, allowing creation of 
new access methods from the existing ones is also suggested. The final sections 
of the present chapter discuss how geometry of high dimensions (asymptotic geo- 
metric analysis) may offer a constructive insight into the performance of indexing 
schemes and, in particular, in the nature of the curse of dimensionality. 

Apart from [[87l . this work was influenced by the excellent reviews of sim- 
ilarity search in metric spaces by Chavez, Navarro, Baeza- Yates and Marroquin 
[|36l and by Hjaltason and Samet ll93l . While [l93l is mostly concerned with de- 
tailed descriptions of each of the existing methods, the main focus of the [|36]| 
paper is on classification of indexing schemes and analysis of their performance. 
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with particular emphasis on the curse of dimensionality. Another good survey 
(in Italian) is Licia Capra's Masters thesis [ |33l . The conceptual framework and 
techniques for explaining the curse of dimensionality comes from the works of 
Pestov Ill54[|152l and this chapter can be thought of as an extension of the results 
presented therein. The paper of Ciaccia and Patella [391, while focusing only on 
one particular scheme, gives an important insight into cost models for similarity 
search. 

It should be noted that while the fundamental building blocks - similarity mea- 
sures, data distributions, hierarchical tree index structures, and so forth - are in 
plain view, the only way they can be assembled together is by examining concrete 
datasets of importance and taking one step at a time. Generally, this thesis shares 
the philosophy espoused by Papadimitriou in HI SOU that theoretical developments 
and massive amounts of computational work must proceed in parallel. Indeed, it is 
our general impression that indexing schemes which are able to take into account 
the underlying structure of a domain often perform better than 'generic' schemes. 

As noted earlier, the main motivation comes from sequence-based biology, 
where similarity search already occupies a very prominent place and where high- 
speed access methods for biological sequence databases will be vital both for 
developing large-scale data mining projects [l73l and for testing the nascent math- 
ematical conceptual models ll34l . 

As seen in Chapter [3l the similarity measures used for biological sequence 
comparison often correspond to partial metrics or quasi-metrics. For that reason, 
a particular emphasis is placed on indexing schemes for quasi-metric workloads, 
which, while frequently mentioned as generalisations of metric workloads (e.g. in 
[|39l ). have been so far been neglected as far the practical indexing schemes are 
concerned. The main technical result of this Chapter, the Theorem 15.7.1 II about 
the performance of range searches, is stated and proved in terms of the quasi- 
metric workloads. 

An indexing scheme for short peptide fragments called FSIndex illustrates 
many of the concepts introduced in the present chapter, and is the main subject of 
the next chapter. 
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5.2 Basic Concepts 

5.2.1 Workloads 

Definition 5.2.1 (|[871[l5l[l53). A workload is a triple W = {n, X, Q), where n 
is a set called the domain, X is a finite subset of the domain {dataset, or instance), 
and Q C is the set of queries, that is, some specified subsets of il. 

(Here, as in the Definition 12.2.131 denotes the set of all subsets of f2 

including 0, the empty set.) 

Answering a query Q E Q means listing all data points x E X HQ. ▲ 

The concept of workload was introduced in [|87l and the original definition is 
slightly extended here by having the queries as subsets of f2 rather than X. This is 
however an important distinction because it is often not directly known what the 
dataset contains and we may want to ask 'questions' (queries) independently of 
possible 'answers' (dataset points). For that reason empty queries are also allowed 
- some processing is usually required in order to decide whether a query is in fact 
empty. There are also technical reasons which are discussed in Subsection 15. 7. 2[ 

The domain can be a very large, even infinite set. It would be tempting at 
this stage to turn the domain with the set of queries into a topological space by 
requiring Q to satisfy the axioms of topology but there is no practical use for that. 
In the later sections, when we define similarity queries, the queries will become 
neighbourhoods of points according to some similarity measure (say a metric) 
and would thus form a base of a topology over ^l. Even in that case, there is no 
need to require that finite intersections or infinite unions of families of queries are 
queries themselves. Indeed, since the dataset X is finite, the finite unions would be 
sufficient for any practical purpose. The dataset itself with the topology induced 
from the domain would be topologically discrete and zero dimensional and thus 
trivial from the topological point of view. 

Examples of workloads abound in database theory - we here focus on the most 
abstract versions that will be important further on. 
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Example 5.2.2. The trivial workload: Q = X = {*} is a one-element set, with a 
sole possible non-empty query, Q = {*}. 

Example 5.2.3. Let X C be a dataset. The exact match queries for X are 
singletons, that is, sets Q = {uj}, uj E ^l. 

Example 5.2.4. Let n e N, Q = K ^ Yi x Y2 x . . . x Yn and X C be a dataset. 
Define the set of queries by Q = {Q^ \ k E K} where Qk = {uj E : uj\k = k}. 
This is the most common type of a query in classical database theory where is a 
table with a key K and a query Qk retrieves all elements of X whose key is equal 
to k. 

Here is the first way to create new workloads: by combining them as disjoint 
sums. 

Example 5.2.5. Let Wi = Xi, Qi),i = 1, 2, . . . , n be a finite collection of 
workloads. Their disjoint sum is a workload W = U^^j^lVj, whose domain is 
the disjoint union f2 = f2i U U . . . U the dataset is the disjoint union 
X = Xi U X2 U . . . U Xn, and the queries are of the form Qi U Q2 U . . . U Qn, 
where Qi G Qj, i = 1, 2, . . . , n. 

Example 5.2.6. Let W = {fl, X, Q) be a workload, and let Q C Q. The restric- 
tion of ly to 6 is a workload W\e with domain 9, dataset X|e = X n 6 and the 
set Q|e of queries of the form Q (1 Q, Q E Q. 

The main objects of this chapter are similarity workloads where the queries 
are generated by similarity (or proximity) measures. 

5.2.2 Similarity queries 

In general, a similarity measure [|4T1 l40l l93l on a set is a function of two vari- 
ables s: r2 X f2 — > M, often subject to additional restrictions. In a strict sense, 
such as in bioinformatics the term similarity measure (or similarity score, or 
just similarity) is used for a function s such that the pairs of 'close' points take 
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a large and often positive value while the points which are 'far' from each other 
take a small (often negative) value. 

Throughout this work we shall always consider dissimilarity JJH |40l or dis- 
tance measures, the similarity measures (in a wider sense) which measure how far 
apart two points are. We require that all the values are positive and add an addi- 
tional requirement that the pair of identical points takes the value (this is differ- 
ent from Remark [2. 1 .21 where we assume in addition that a distance satisfies the 
triangle inequality). The justification is that most commonly used (dis)similarity 
measures are metrics or at least quasi-metrics and that it is almost always possible 
to convert a similarity measure in a strict sense into a dissimilarity measure. 

Definition 5.2.7. A dissimilarity measure on a set is a function d: VL^VL ^ IR+ 
where for all uj ^Vt, d{u!, a;) = 0. ▲ 

The three types of queries based on a dissimilarity measure of most interest 
[|36l are: a range query, a nearest neighbour query and a k-nearest neighbours (or 
kNN) query. 

Definition 5.2.8. Let be a set, d a dissimilarity measure on fi, X C a dataset 
andr G IR+. T\\t(r-) range similarity query centred at ui G denoted (5™^(u;, r), 
is defined by 

Q^^iuj.r) = {x eVt: d{uj,x) < r}, 

that is, Q^^{uj, r) consists of all a; G f2 that are within the distance r oiuj. We will 
denote by Q™^ the set {Q™°{uj, r) | G fi, r G of all possible range queries. 
We call a workload (fi, X, Q^"^) a range (dis)similarity workload. ▲ 

If c/ is a quasi-metric, the range query Q^J^{uj,r) corresponds exactly to the 
left closed ball ?B^(cij) and if is a metric then Q'^^^^cu, r) = ^^i^), the closed 
ball of radius r about u. 

Definition 5.2.9. Let be a set, d a dissimilarity measure on and X C Q a 
dataset. The nearest neighbour query centred at lu E Q, denoted Q^^{uj, X), is 
defined by 

QT{^, X) = {xeX : d{uj, x) < d{uj, y) for all y G X}, 
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that is, it consists of members of X closest to u. 

Denote by d^{uj) the distance to a nearest neighbour of cu in X. 

We call a workload (fi, X, Q^^) a nearest neighbour (dis)similarity workload. 

▲ 

Definition 5.2.10. Let be a set, d a dissimilarity measure on Vt and X C f2 a 

dataset and let 

Tk = inf{r > : |Q7(cu, r) n X| > k}. 

The k-nearest neighbour query centred at u E Q, also called a /cAW query, de- 
noted ^(o^, X), is defined by 

gf^(c.,x) = g7(a;,r,)nx 

In other words, Qf^^{uj, X) is a set of k elements of X closest to u plus any other 
elements of X at the same distance as the k-th nearest neighbour. 

We call a workload X, Q^^^) a kNN (dis) similarity workload. k 

The nearest neighbour and the /c-nearest neighbours queries are jointly called 
NN -queries ll36l . Unlike range queries, they directly depend on the dataset X. 
Note that our definition of /cNN queries differs from the one commonly used in 
the literature Il36ll93l , where any set of k elements of X closest to uj is sufficient 
to satisfy a /cNN query. We chose the above definition for consistency - every 
algorithm is guaranteed to return the same result and g^^^(u;, X) denotes a single 
set and not a family of sets. 

Our definition also makes the connection between NN-queries and range queries 
explicit: any NN-query can be expressed in terms of a range query. For example, 
for a nearest neighbour query, we have g^^(u;, X) = X n g™^(c<j, d^{uj)). Of 
course, in practical situations, d^{uj) is not known in advance. Nevertheless, we 
shall mostly concentrate on range similarity queries and workloads as the most 
fundamental of the three and easiest to process. 

Definition 5.2.11. Let 1] be a domain and di and d2 dissimilarity measures. If 
Q™^ = Q™^ we call di and d2 equivalent. A 
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Example 5.2.12. Let (fi, di) and (fi, c/2) be metric spaces. Recall that two metrics 
di and ^2 are equivalent if and only if there exist strictly positive constants a, b 
such that for all x,y G fi, adi{x,y) < d2{x,y) < bdi{x,y). The metric and 
dissimilarity measure notions of equivalency do not follow from each other. 

Take a set f2 = : n G N+} U {0} with the metrics di and c/2 where 
di{x, y) = \x — y\ and d2{x, y) = a/|x — y\. It is clear that di and d2 axe equiv- 
alent as dissimilarity measures since they generate the same sets of balls while 
there is no strictly positive constant a such that for all x G \/x < ax and thus 
di and ^2 are not equivalent as metrics. 

On the other hand, let 1] = where di{x, y) = a/ (xi — yi)'^ + (x2 — ^2)^ 
and d2{x,y) = a/ (xi — yi)^ + 2(0:2 — 2/2)^- It is easy to see that di and ^2 are 
equivalent metrics but not equivalent dissimilarity measures since di generates 
the balls of circular shape (Euclidean balls) while d2 generates elliptical balls. 

If d2 is obtained from di by a metric transform, (i.e. d2{x, y) = F{di{x, y)) 
where F : [0, +00) [0, +00) is a concave monotone function with -F(O) = 0), 
then di and ^2 are equivalent as similarity measures. One example of a metric 
transform is d2 = adi for some a > 0, where ^2 is a multiple of di. 

5.2.3 Indexing schemes 

Definition 5.2.13. An access method for a workload W is an algorithm that on an 
input Q G Q outputs all elements oiQ n X. k 

Typical access methods come from indexing schemes. 

Definition 5.2.14. Let T be a rooted finite tree. Denote by L{T) the set of leaf 
nodes and by /(T) the set of inner nodes of T. The notation t G T means that t is 
a node of T, and Ct denotes the set of all children of a t G /(T). For any non root 
node t, the parent of t is denoted p{t). ▲ 

Definition 5.2.15. Let W = {Vt, X, Q) be a workload. An indexing scheme on W 
is a triple J = (T, 5"), where 

• T is a rooted finite tree, with root node *, 
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• is a collection of subsets i?i C f2 ( blocks, or bins), where t G L(T), 
such that X C IJ^g^^^.) Bt. 

• 5"= {Ft: t E I (T)} is a coWectionof set-valued decision functions, Ft: Q - 
2'^\ where each value Ft{Q) C C( is a subset of children of the node t. 



▲ 
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Algorithm 5.2.1: .RetrieveIndexedQuery(J, Q) 

comment: Indexing scheme J = (T, 3") over W = X, Q) 
comment: Query Q e Q 

Ao ^ {*} 

i ^ 

while ^ 

for each t e Ai 

itt i L{T) 
then ^ Ai+i UFt(g) 



do < 



do < 



else for each x e Bt 

ifx e Q 



do 



then RU{x} 



return (R) 



Hence, an indexing scheme consists of a cover =^ of X by blocks and a tree 
structure that determines the way in which a query is processed: for each query 
we traverse those nodes that have been selected at their parent nodes using the 
decision functions (Figure 15.21) . Each of the bins associated with selected leaf 
nodes is sequentially scanned for elements of the dataset satisfying the query. The 
Algorithm l5.2. 1 [ depicts a breadth-first traversal of the tree but any other equivalent 
algorithm can be used. We will only consider consistent indexing schemes: those 
for which the above procedure retrieves all dataset elements belonging to any 
query, that is, no query points are missed. This is more formally expressed by the 
following definition: 



Definition 5.2.16. An indexing scheme J = (T, 3") for a workload W = 
X, Q) is consistent if for every Q E Q and for every x E Q f] X there ex- 
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ists t G L{T) such that x E Bt and the path sqSi . . . Sm, where sq = *, Sm = t 
and Si = satisfies s^+i G Fg- (Q) for all z = 0, 1 ... m — 1. ▲ 

Clearly, for a consistent indexing scheme, any algorithm which, for any query, 
starting from the root, visits all branches returned by the decision functions at each 
node and scans all bins associated with the leaf nodes visited for the members of 
the query, is an access method. The Algorithm 15 .2.11 provides one example. 

Our definition of indexing scheme extends the definition of [[87l which consid- 
ers only the set of blocks. The computational complexity of the decision functions 
Ft{Q), as well as the amount of 'branching' resulting from an application of Al- 
gorithm [52II1 become major efficiency factors in case of similarity-based search, 
which is why we feel they should be brought into the picture. 

Note that blocks may overlap in an indexing scheme, that is, a point x E X 
can belong to several blocks. There may even be different leaves pointing to the 
same block. This observation is at the heart of the concept of storage redundancy 
developed in [[87l and [[86l which will be examined later. 

We now present examples of indexing schemes related to some of the most 
fundamental algorithms of computer science, reformulating them within our pro- 
posed framework. We provide a very short description and a reference to the 
appropriate section of the Volume 3 (Sorting and Searching) of Knuth's 'The Art 
of Computer Programming' (TAOCP) [|111| - It should be noted that while the 
discussion in TAOCP applies to exact searches, the ideas in many cases apply to 
more general cases with very few modifications. 

Example 5.2.17. A simple linear scan (TAOCP, Vol. 3, Section 6.1) of a dataset 
X corresponds to the indexing scheme where the tree T = {*,*} has a root * and 
a single child ^^r, B consists of a single block 5* = il, and the decision function 
always outputs the same value {^^r}. 

Example 5.2.18. Hashing (TAOCP, Vol. 3, Section 6.4) can be described in terms 
of the following indexing scheme for exact searches. The tree T has depth one, 
with its leaves corresponding to bins, and the decision function is a hashing 
function: on input of a query object Q it outputs the bin in which the elements of 
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X matching Q are stored. If there are collisions (i.e. different objects mapping to 
the same bin), the retrieved bin needs to be further processed. 

A related technique, which can be used in some cases, is to store the results of 
commonly used queries and retrieve them at search time using a hash function. 

Example 5.2.19. If the domain is linearly ordered and the set of queries consists 
of intervals [a, b] then an efficient indexing structure is constructed using a gener- 
alisation of binary search trees (TAOCP, Vol. 3, Section 6.2). Each bin contains 
one element of the dataset and every node t E T is associated with an interval 
[^1,^2] which, in the case of an inner node, covers the intervals associated with the 
children of t and in the case a leaf node corresponds to the element of the dataset 
contained in the bin Bt (Figure l53l) . Each decision function Ft on an input [a, b] 
outputs the set of all children nodes s of t such that [si, S2] fl [a, b] 7^ 0. 

Generalisations of this idea form the core of indexing schemes for similarity 
workloads (Sections l5.3l and [S4l) . 

[1,10] 



[1,5] [6,10] 
[1,3] [4,5] [6,8] [9,10] 

[1,2] [3,3] [4,4] [5,5] [6,7] [8,8] [9,9] [10,10] 
[1,1] [2,2] [6,6] [7,7] 

Figure 5.3: An indexing tree for range queries of a linearly ordered dataset of 10 ele- 
ments. 



5.2.4 Inner and outer workloads 

Definition 5.2.20. A workload W = (^2, X, Q) is called inner if X = Q and outer 
otherwise. ▲ 

Typically, for outer workloads \X\ <^ The difference between inner and 
outer workloads is particularly significant for similarity searches because inner 
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similarity workloads can be thought of as directed weighted graphs where the 
dataset points are nodes and two nodes are connected with an edge with a weight 
corresponding to their similarity. In such case, it may be possible, depending on 
the characteristics of the graph and the types of queries, to use graph traversal 
algorithms as access methods. 

In theory, every workload W = X, Q) can be replaced with an inner work- 
load (X, X, Q\x), where the new set of queries Q\x consists of sets QCiX, Q e Q. 
However, in practical terms this reduction often makes little sense because while 
the complexity of storing and processing the query sets QnX remains essentially 
the same, and in addition to requiring the domain to be implicitly present, we 
lose a geometric clarity of having the set f2 present explicitly. 



5.3 Metric trees 

Most existing indexing schemes for similarity search apply to metric similar- 
ity workloads, where a dissimilarity measure on the domain is a metric and the 
queries are balls of a given radius. Some indexing schemes apply only to a re- 
stricted class of metric spaces, such as vector spaces, others apply to any metric 
space. In most cases we encounter a hierarchical tree index structure where each 
node is associated with a set covering a portion of the dataset and a certification 
function which certifies if the query ball does not intersect the covering set, in 
which case the node is not visited and the whole branch is pruned (Figure [5l4l) . 
We show that for such indexing scheme to be consistent, that is, that no members 
of the dataset satisfying the query are missed, the certification functions need to 
be 1-Lipschitz. The following concept of a metric tree in its present precise form 
is new, and is based on our analysis of numerous existing approaches, which all 
turn out to be particular cases of our concept. 

Definition 5.3.1. Let (fi, X, Q™^) be a range dissimilarity workload, where is a 
metric. Let T be a finite rooted tree with root * and ,^^={i?^|^GT}a collection 
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Figure 5.4: A metric tree indexing scheme. To retrieve the shaded range query the nodes 
above the dashed line must be scanned; the branches below can be pruned. 



of subsets of Q such that 



and for every inner node t. 



X C [j BtCn 
teL(T) 



|J(5,nx)c5,. 

seCt 



(5.1) 



(5.2) 



Also, let 3" = {ft: Q ]R|t G T\{*}}bea collection of functions, called 
certification functions, such that for each t G T\ {*}, 
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• ft is 1-Lipschitz, and 

• For all G Bt, ft{u) < 0. 

We call the triple (T, ^, J") a metric tree for the workload {VL, X, Q™"). Let ^ = 
{Bt\te L{T)} and 3^ = {Ft : Q ^ 2^* | t e /(T)} where 

Ft{We{uj)) = {seCt: fs{uj) < e}. (5.3) 

The indexing scheme J(T, 5") = (T, i^, 3") is called a metric tree indexing 
scheme. A 

The theoretical significance of the proposed concept is stressed by the follow- 
ing result. 

Theorem 5.3.2. Let W = {Q, X, Q™^) be a metric similarity workload and (T, e^, 3") 
a metric tree. Then the metric indexing scheme !J(T, 3") is a consistent indexing 
scheme for W. 

Proof. Let Q = ^^{uj) be a range query and let x E Q f] X, that is, d{ij, x) < e. 
By (15.11) . there exists a leaf node t such that x E Bt. Consider the path sqSi . ■ ■ Sm 
where Sq = *, = t and Si = p{si+i), from root to t. By (15.21) . for each 
i = 1, 2 ... m, we have {Bt n X) C (5^^ n X) C Bs^_^ and hence x E B^^. It 
follows that /s,(x) < and since fs^ is a 1-Lipschitz function, we have 

fsX^) <\fsX^) - fsA^)\<d{u;,x) <e. 

Therefore, Si E Fs^_^ and hence (T, 3") is a consistent indexing scheme. □ 

Once the collection Bt,t E T of blocks has been chosen, the certification 
functions always exist. 

Theorem 5.3.3. Let {fl, X, Q™^) be a range dissimilarity workload, where d is a 
metric, T be a finite rooted tree with root * and ^ = {Bt \tET}a collection of 
subsets of Q satisfying ( I5.il) and ( 15.21) . Then, for each t E T where t ^ *, there 
exists a 1-Lipschitz function ft such that ft{^) < Ofor all uj E Bt. 
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Proof. Put fti^^) = d{Bt,uj) = inixeBt d{x,uj). By the Lemma 12.4.51 / is 1- 
Lipschitz and clearly ft\Bt = 0. □ 

However, the distances from sets are typically computationally very expen- 
sive. The art of constructing a metric tree consists in choosing computationally 
inexpensive certification functions that at the same time don't result in an exces- 
sive branching. 

We now briefly review some of most prominent examples of metric trees. We 
concentrate on their overall structures in terms of the above general model and pay 
less attention to the details of algorithms and implementations, even though they 
significantly influence the performance. For many more examples and detailed 
descriptions the reader is directed to the original references as well as the excellent 
reviews ll36l and ll93l . The concept of a general metric tree equipped with 1- 
Lipschitz certification functions was first formulated in the present exact form in 



5.3.1 Vector space indexing schemes 

We first examine indexing schemes for 'classical range searches', that is, for vec- 
tor space workloads where the domain is M" and the set of queries is given by 
the balls with respect to the £^ metric, also called rectangles. The rationale for 
this terminology is given by the shape of unit balls with respect to the norm 
in - the shapes of i^, £2 ^rid i"^ balls are shown in Figure [53! Note also that 
this is the most general setting since for any 1 < p < 00 an ip ball is contained 
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Figure 5.5: The shapes of the £f , £2 and £^ unit balls. 
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in the i"^ ball with the same centre and radius and hence an access method for a 
ip workload can be obtained by what we call a projective reduction (Subsection 
15 .6.41 below) to the l"^ workload. In practice, queries can be even more general, 
consisting of rectangles with sides of different lengths but this does not add any- 
thing to generality conceptually (if not in practical terms) since such queries can 
be represented, for example, as unions of (unit) balls. 

Example 5.3.4. The R-tree [[84l is a dynamic structure for indexing points and 
rectangles in vector spaces. Many variants showing performance improvements 
exist, such as the R"'"-tree [I172II and the R*-tree [[T2|. The main feature of all 
variants is that bounding rectangles are used to enclose data points (at leaf nodes) 
or bounding rectangles of children nodes. 

The R-trees are paged structures - nodes are stored in secondary memory and 
retrieved as needed. Each non-root node of the tree T has between m and M 
children with all leaves containing data points or rectangles appearing at the same 
level. The minimum bounding rectangle Rt is associated to each node t E T 
(Figure [531) . A node t is visited if the query rectangle intersects Rt, that is, certi- 
fication functions are : i— > d{uj, Rt), where d is the ^oo-metric. The structure 
is fully dynamic - insertions and deletions can be intermixed with queries. 

The main factor in performance of R-trees is organisation of bounding rectan- 
gles. The optimisations of the R*-tree, which was shown to have the best perfor- 
mance of the above mentioned three variants, are based on reduction of volume 
and lengths of the edges of bounding rectangles at each node as well as on min- 
imisation of overlap between rectangles associated with different nodes. 

Example 5.3.5. The X-tree fTTl is a modification of the R-tree suitable for index- 
ing high-dimensional vector space workloads. It is based on the observation (see 
Subsection 15.7.31) that high overlap between bounding rectangles of many chil- 
dren of R-tree nodes in high dimensions, leading to sequential scan of all them, 
is unavoidable. Hence the nodes whose bounding rectangles overlap to an exces- 
sively high degree are collapsed into supernodes which are organised for linear 
scan (Figure [5^ . The X-tree uses the same certification functions as the R-tree: 
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the distances to bounding rectangles. The authors report that X-tree outperforms 
the R*-tree by as much as 8 times on high dimensional datasets. 

Example 5.3.6. Consider the vector space workloads where the metric is the Eu- 
clidean (£2) distance (more generally the weighted Euclidean distance where w 
is a vector of weights and d{x,y) = Wi{xi — UiY). The SS-tree 112 1011 is 

an indexing scheme where bounding spheres instead of bounding rectangles are 
used at each node (Figure [5^ . More precisely, the region Bt associated with each 
node t is a ball centred at xt, the centroid of all dataset points covered by Bt, with 
the covering radius rt = max{(i(a;j, \ y E X n Bt}. Hence, the certification 
functions are of the form ft{oj) = d{uj, Xt) — rt. 




Figure 5.6: An example of R-tree in two dimensions. 
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5.3.2 General metric space indexing schemes 

We now turn to the indexing schemes for general metric space workloads where 
no structure in addition to metric is assumed, that is, all that is available at creation 
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Figure 5.7: Structure of X-tree. 
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Figure 5.8: An example of SS-tree. 
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time is the set of data points and a metric d. 



Example 5.3.7. The vp-tree I1217II is an indexing scheme with a binary tree and 
certification functions of the form ft±{^) = ± id(uj,xt) — Mt), where Xt G X 
is a vantage point chosen for the non-leaf node t, Mt is the median value for the 
function u t-^ d{u;, Xt), and t± are two children of t. Thus, at each non-leaf node 
t, a part of the dataset covered by Bt is partitioned into two equal halfs where 
Bt^ = Btn "BMA^t) and Bt_ = B^ \ (Figure[521). 

The m-ary versions, where the dataset is split in m-equal parts at each node, 
have also been proposed. 
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Figure 5.9: An example of a binary vp-tree with vantage points xq, xi and X2. The leaf 
nodes si to 54 correspond to regions Bi to B4. 



Example 5.3.8. The mvp-tree ll25l is a modification of the vp-tree which uses 
multiple vantage points at each node. In the binary case, for any node t, two 
vantage points, xi and X2 are chosen and the part of the dataset covered by B^ is 
split in four parts. 

Let t be an inner node and gi and g2 be the functions ^ M where gi{uj) = 
d{uj,xi) and (72(1^) = d{uj,X2). Let Mi be the median value for gi and 5+ = 
Bt n *BAfj(xi), B_ = Bt\ ^Mii^i). Let M2+ be the median value for (72I-B+ 
and M2- the median value for g2\B_. The certification functions for the children 
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Figure 5.10: An example of an mvp-tree with vantage points xi and X2. The leaf nodes 
si to S4 correspond to regions Bi to i?4. 

^1; ^27 ^3; ^4 ^TC 



The maxima above are computed from left to right and the second value is not 
computed if the first exceeds the search radius. The main difference from the 
binary vp-tree is that two instead of three vantage points are used to divide a 
covering region into four regions, resulting in fewer distance computations. 

Example 5.3.9. The GNAT (Geometric Near-neighbour Access Tree) indexing 
scheme proposed by Sergey Brin ll27l . one of the founders of Google, is based on 
splitting the domain Bt at each node t into m regions i?^ . based on proximity to 
the split points Xt^,Xt2, ■ ■ ■ G X, yielding an m-ary tree (Figure [5. 111) . The 
sets -Bt-, called Dirichlet domains, correspond to Voronoi cells in M". For each 
pair of split points Xt-,Xt^, the values rl^^ = mm{d{xt^,y) \ y E Bt^ fl X} and 
r^-' = max{(i(xi., y)\y E Bt^ fl X} are stored. The certification functions are of 




max{(i(co', Xi) — Mi, d^u, X2) — M2+}, 
ma.x{d{uj, Xi) — Mi, M2+ — d{uj, X2)}, 



ma.x{Mi — d{u!,Xi),d{Lj,X2) — M2^}, and 
max{Mi — d{uj, Xi), M2- — d{uj, X2)}. 



5.3. METRIC TREES 



147 



the form 

ftAu) = maxmax{(i(u;,a;i) - r^f ,r[;^ - d{uj,Xi)}. 




Figure 5.11: An example of GNAT. 



Example 5.3.10. Unlike the vp-tree and the GNAT but like the R- trees, the M- 
tree [41] is a dynamic and paged structure. The tree is binary and at each node 
t a routing object Xt G X is stored together with the covering radius = 
maXy^BtCiX d{xt, y) and the distances to the routing objects of the children. The 
certification functions are of the form 

fs{u^) = max{\d{uJ,Xp^s)) - d{Xp(^s),Xs) \ - r^, d{uJ,Xs) - r J . 

If the value \d{uj, Xp(^s)) — d{xp(^s)-,Xs) \ — exceeds e the rest of fs need not be 
computed. This avoids potentially expensive computation of d{uj, Xg). The way 
the routing points are chosen and data points divided between them is determined 
by the user by choosing one of many available split policies. The best performing 
policy was found to be the generalised hyperplane decomposition where each data 
object is assigned to the routing object closest to it. 

The QIC-M-tree is a modification of the M-tree where instead of one, three 
distances on are used: the index distance, dj, to construct the index, the com- 
parison distance, dc, to be used in certification functions, and the query distance. 
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dq, according to which the actual result must be computed. The structure of the 
QlC-M-tree is the same as the structure of the M-tree except that the value of a 
certification function fs{Lo) is 

where Xs in the routing point of node s and r., is the associated covering radius. 
As before, the evaluation is from left to right and is stopped as soon as one of 
the expressions exceeds the query radius. It is clear that for consistency of such 
indexing scheme it is necessary and sufficient that the identity maps (fi, dq) 
[VL, dj) and (Q, dq) dj) be 1-Lipschitz (Ciaccia and Patella allow for the 

scaling factors in the case this is not so). Any dq finer than dc and di can be used 
as a query distance. 

Modifications of the M-tree allowing for processing of complex queries have 
been proposed in HOll . 

5.4 Quasi-metric trees 

Although often mentioned as possible generalisations of metric workloads (e.g. 
in ||39l ). quasi-metric workloads have been so far neglected as far the practical 
indexing schemes are concerned. As our biological examples attest (Chapter [3]), 
quasi-metrics in fact often appear as similarity measures on datasets, even if they 
are not recognised as such. 

For a nearly symmetric quasi-metric o? on a set n, where the asymmetry r(x, y) = 
\d{x, y) — d{y, x) \ is small compared to the expected scale of the search, it may be 
possible to replace it by a suitable metric without significant loss of performance 
by the way of what we call a projective reduction of a workload (Subsection l5.6.4l) . 
We find a metric p such that p(x, y) < Kd{x, y) for all x,y & where K is the 
smallest positive constant ensuring the above inequality (K is in fact the Lipschitz 
constant of the map (^l, d) (fi, p)) and index the metric space (fi, p/ K). The 
QIC-M-tree ||39l provides exactly the framework to do so. Obvious choices for p 
are d^ or d^. In the next chapter we perform the analysis of this approach for a set 
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of peptide fragments. 

However, if the quasi-metric in question is highly asymmetric, significant loss 
of performance may result because the required Lipschitz constant may be very 
large (or even non-existent if d is a Tq quasi-metric) and the metric p becomes a 
poor approximation to d. It is therefore desirable to develop a theory of indexa- 
bility for quasi-metric spaces. 

We use left 1 -Lipschitz functions as certification functions to establish the di- 
rect analogs of the Definition 15.3.11 and the Theorem 15.3.21 (indeed, the advantage 
of our general model is that it allows the incorporation of the quasi-metric case 
with very few differences). Recall that a left 1 -Lipschitz function X — M from 
a quasi-metric space {X,d) satisfies f{x) — f{y) < d{x,y) for all x,y E X 
(Definition [2AB. 

Definition 5.4.1. Let (Q, X, Q™^) be a range dissimilarity workload, where d is a 
quasi-metric. Let T be a finite rooted tree with root * and let ^ = {Bt \ t E T} 
be a collection of subsets of f2 such that 



Also, let 3^ = {ft: f2^]R|tGT\{*}}bea collection of certification functions 
such that for each t G T \ {*}, 

• ft is left 1 -Lipschitz, and 

• For all uj E Bt, /t(cj) < 0. 

We call the triple (T, 3") a quasi-metric tree for the workload {Vt, X, Q™^). Let 




(5.4) 



and for every inner node t. 




(5.5) 



^={Bt\tE L{T)} and 3" = {Ft: Q ^ 2^' \ t E /(T)} where 



Ft{^^{cu)) = {sECt: fs{u;)<e}. 



(5.6) 



The indexing scheme J(T, 3^) = (T, 5") is called a quasi-metric tree index- 
ing scheme. ▲ 
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Theorem 5.4.2. Let W = (fi, X, Qj^) be a quasi-metric similarity workload and 
(T, 3") a quasi-metric tree. Then the quasi-metric indexing scheme J(T, 3") 
is a consistent indexing scheme for W. 

Proof. Let x G 23^ (cu) n X. By (l54l) . there exists a leaf node t such that x G 5*. 
Consider the path sqSi . . . Sm where sq = *, Sm = t and Si = from root 

to t. By (Q, for each z = 1, 2 . . . m, we have (fit n X) C (fi,^ n X) C fi,^^ 
and hence x E Bs^. It follows that fsX^) — ^^^^i since fs^ is a left 1-Lipschitz 
function, we have 

fsA^) < fsA^) - fsA^) < d{uj,x) < e. 

Therefore, Si E Fs^_-^ and consistency follows. □ 

As with metric trees, certification functions satisfying the above properties 
always exist - they are provided by the distances from points to covering sets. 

Theorem 5.4.3. Let X, Q^"^) be a range dissimilarity workload, where d is 
a quasi-metric, T be a finite rooted tree with root * and ^ = {Bt \ t E T} 
a collection of subsets of Q satisfying l\5.4\) and 0.51) . Then, for each t E T 
where t ^ *, there exists a left 1-Lipschitz function ft such that f{uj) < Ofor all 
LU E Bt. □ 

Proof. Put ft{uj) = d{Bt,uj). By the Lemma [2.4. 5[ / is left 1-Lipschitz and 

ft\Bt = 0. □ 

No general quasi-metric tree indexing scheme has been produced as yet - our 
indexing scheme for protein fragments (Chapter[6l) is an example of a quasi-metric 
tree but is not general. While it is possible to generalise existing indexing schemes 
to support quasi-metric queries, the resulting structure is usually more complex. 
For example, while the function ■ to ^ d{uj,x) is left 1-Lipschitz (Lemma 
12.4.41) . —dx is right 1-Lipschitz but not necessarily left 1-Lipschitz and hence the 
generalisation of the vp-tree (Example 15.3.71) certification functions as they are, 
just by replacing the metric with a quasi-metric, is not possible. If the distances 
from the same vantage point are desired to be used at each node, both the left 
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and the right distance need to be computed and cutoff values chosen so that the 
whole dataset is covered and (if possible - it may not be) that overlap is minimal. 
The same is true for the GNAT (Example 15. 3. 91) : certification functions need to be 
adjusted to be left 1-Lipschitz and for this it is necessary to compute both left and 
right distance to the split points. Hence, additional computation may be necessary 
at each node, adversely affecting the performance. 

It appears that, out of all our examples of metric indexing schemes, the M-tree 
(Example 15. 3. 101) is most suitable for adaptation for indexing quasi-metric work- 
loads. The structure of a balanced binary tree should remain while the covering 
set at each node s should be the right closed ball *B^^(xs) of radius about the 
routing object Xg. The certification function fs should be set so that 

fs{uj) = max {d{uj, Xp(^s)) - d{xs, Xp(^s)) - Ts, d{u, Xs) - r^} . 

The distances d{xs,Xp(^s)) from routing objects to their parents, as well as the 
covering radii = max{g(y, Xs) \ y E Bg], can be, as is the case with M-tree, 
computed and stored at creation time. 

The above proposal for turning the M-tree into a quasi-metric tree is, at present, 
only conceptual. Many challenges remain, for example in designing a good split 
policy to be used in the creation algorithm. If an attempt to develop a quasi- 
metric version of M-tree is made, it will be necessary to test it on a variety of 
actual quasi-metric datasets. 

5.5 Valuation Workloads and Indexing Schemes 

Closely related to similarity workloads are what we call valuation workloads. 

Definition 5.5.1. Let 17 be a set, X C a dataset and / a function ^ M. For 

r G IR+ the (r-) range valuation query, denoted Q^^{r), is defined by 

Q'^g(r) = {xeVL: f{x) < r}. 

We denote by Q"^ the set {Qj^{r) | r G M+} and call a workload (fi, X, Q'"^) a 
range valuation workload. ▲ 
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Definition 5.5.2. Let T be a rooted tree. A function / : T — M is increasing on 
Tifforalls G T,t e C„/(s) < /(t). ▲ 

Definition 5.5.3. Let X, Q™^) be a range valuation workload and suppose T 
is a finite rooted tree with root * and 3§ = {Bt \ t G L(T)} a collection of subsets 
of VL such that X C [J^^^j<^ Bt C f2. Suppose : T ^ M is increasing on T and 
for all t G L{T), 

git) < inf fix). 

x£Bt 

Let •Jg = {Fs\s e /(T)} where F,(Q™^(r)) = {t g : gis) < git)}. The 
indexing scheme = (T, 3^g) is called a valuation indexing scheme. A 

Theorem 5.5.4. Every valuation indexing sctieme is consistent. 

Proof. Let Jg = (T, J'g) be a valuation indexing scheme over a range valuation 
workload i^l,X, Q'p) and Q G Q™". Suppose x G Q n X, that is fix) < r for 
some r > 0. Since =^ is a cover of X, there exists a leaf node t such that x G i?*. 
Consider the path sqSi . . . where sq = *, = t and Si = from root 

to t. Since (7 is increasing on T, we have gi-So) < gi-Si) < . . . < git) < fix) < r 
and therefore Si G Fs-__-^ for each i = 1, 2 . . . m. □ 



Valuation workloads are perhaps not very interesting on their own but it should 
be noted that every workload can be decomposed as a union of valuation work- 
loads having the same underlying domain and dataset (Subsection l5.6.2l) . If a tree 
structure is present, the Theorem 15 .5 .41 ensures that a consistent indexing scheme 
can be constructed. 



5.6 New indexing schemes from old 

Here we formulate in an abstract setting some constructions commonly used to 
generate new access methods from the existing ones. Our general approach makes 
these constructions amenable to analysis by means of theoretical computer sci- 
ence. 
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5.6.1 Disjoint sums 

Any collection of access methods for workloads Wi,W2, ■ ■ ■ , Wn leads to an ac- 
cess method for the disjoint sum workload U"^^iyj: to answer a query Q = 
L-I^LiQi, it suffices to answer each query Qi, i = 1,2, ... ,n, and then merge 
the outputs. 

In particular, if each Wi is equipped with an indexing scheme, Jj = (Tj, I^i, 3"^), 
then a new indexing scheme for U"^]^PVj, denoted J = U^^^Jj, is constructed as 
follows: the tree T contains all T/s as branches beginning at the root node, while 
the families of bins and of decision functions for J are unions of the respective 
collections for all "Ji, i = 1,2, . . . ,n. 

This construction is often used coupled which an equivalence relation which 
partitions the domain, instance and each of the queries into smaller spaces, per- 
haps with a better structure which are then indexed separately ('subindexed'). A 
good illustration is our indexing scheme for weighted quasi-metric spaces. 

Example 5.6.1. Recall that a weighted quasi-metric (Section [Z61) over a domain 
r2 is a quasi-metric d such that for some weight function w and for all x,y E ^l, 

d{x, y) + w{x) = d{y, x) + w{y). 

The following Proposition shows that any weighted quasi-metric similarity work- 
load W = (fi, X, Q™^) can be indexed using the decomposition into a disjoint 
union of metric spaces or fibres, one for each value that the weight function w 
takes. 

Proposition 5.6.2. Let {Q, d, w) be a weighted quasi-metric space and denote by 
Gz the set {x E Vt : w{x) = z}, and by *B*e(x) the closed ball of radius e 
centred at x E with respect to the metric p where for each x,y E fl, p{x, y) = 
\ {d{x, y) + d{y, x)) = y). Then 

(ii) 03^ (x) = U^eu-m) ^ei^)\Gz for all X e Q, e > 0, and 
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(Hi) = ^Vi(.-«;W)(^)|G^^ forallxe fi, e > 0. 

Proof. The first two statements are obvious while the third claim follows directly 
from 

p(^> y) = \ (^(^' y) + ^(2/' ^)) = ^(^' y^^\ ~ ^(^)) • 

□ 

Therefore, provided that w takes few values on the dataset (otherwise close 
fibres need to be merged), it is possible to index into W by indexing data points 
for each fibre using one of the existing indexing schemes for metric spaces and 
then collecting the results. We call this scheme a FMTree (Fibre Metric Tree). 
Some of our attempts to use this scheme to index into datasets of short protein 
fragments are described in the next chapter. 

5.6.2 Query partitions 

A similar technique can be used where the set of queries over some domain is 
partitioned and separate indexing scheme exists for each partition. 

Let r2 be a domain, X C a dataset and Qj, z = 1, 2, . . . , n a pairwise disjoint 
family of queries over Vt. A collection of access methods for the workloads Wi = 
(n, X, Qi) leads to an access method for the workload W = X, |J^^^ Q^): to 
answer a query Q E \_\^=i Qi, find i such that Q E Qi and answer it using the 
access method for the workload W^. 

As in the disjoint sum case, if each Wi is equipped with a consistent indexing 
scheme, Jj = (Tj, ^i, J'j), then a new consistent indexing scheme for W, denoted 
3 is constructed as follows: the tree T contains all Tj's as branches beginning at 
the root node, while the families of bins and of decision functions for J contain 
the unions of the respective collections for all Jj, z = 1,2, ... ,n. The decision 
function at the root for each query Q E Qi returns the set consisting of the branch 
Tj. We call such indexing scheme a query partitioning indexing scheme. 

A query partitioning indexing scheme can be considered to be highly redun- 
dant (see Subsection 15.7.11 for the precise definition of redundancy of indexing 
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schemes) since each major branch contains the bins covering the whole dataset 
which, in many cases, may occupy considerable space. However, in some cases 
it may be possible for such indexing scheme to occupy the space much more ef- 
ficiently. Our indexing scheme for protein fragment workloads, called FSindex, 
is a good example of the query partitioning approach with no redundancy - each 
data point is stored only once. 

5.6.3 Inductive reduction 

Let Wi = (fij, Xi, Qj), i = 1, 2 be two workloads. An inductive reduction of Wi 
to W2 is a pair of mappings z : ^2 ^ ^1, '■ Qi ^ Q2, such that 

• ^(^2) D Xu 

• for each Q e Qi, i~\Q) C i'^{Q). 

i 

Notation: W2^Wi. 

An access method for W2 leads to an access method for Wi, where a query 
Q G Qi is answered as in the Algorithm l5.6.1[ 

'Algorithm 5.6.1: VTi.RetrieveQueryCQ) 
comment: W2 = (^2,^2, Q2) i VTi = (l^i,Xi, Qi), Q e Qi 

R2 ^ H^2-RETRIEVEQUERY(i'^(Q)) 

comment: _R2 = ^2 H i'^{Q) 
for each y E R2 

do ^ « 

[ theni?i ^ i?i U {i{y)} 

return 

If J2 = (^2, ^2, 3^2) is a consistent indexing scheme for W2, then a consistent 
indexing scheme Ji = r^(Ji) for Wi is constructed by taking Ti = T2, B^'' = 
and Fi^\Q) = F^^\i'^{Q)) (the upper index z = 1, 2 refers to the two 
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workloads). The bigger workload used for inductive reduction usually carries a 
structure that supports an efficient access method. 

Example 5.6.3. Let F be a finite graph of bounded degree, k. Associate to it a 
graph workload, Wr, which is an inner workload with X = Vr, the set of vertices, 
and Q = {Qf^^{v, Vr) \ v G Vr}, the set of /cNN queries where d is the shortest 
path metric on T. 

A linear forest is a graph that is a disjoint union of paths. The linear arboricity, 
la{T), of a graph V is the smallest number of linear forests whose union is V. 
This number is, in fact, fairly small: it does not exceed [3/^/5], where D is 
the degree of V |[82l O. The Linear Arboricity Conjecture 13, which states 
that la{T) < \^^~\ , was found to hold for numerous cases jSl. Results for k- 
linear arboricity, the minimum number of forests whose connected components 
are paths of length at most k are also available II125II . This concept leads to an 
indexing scheme for the graph workload Wy, as follows. 

Let Fi, i = 1, . . . ,/a(r) be linear forests. Denote F = U-^^pFj and let 
0: F ^ r be a surjective map preserving the adjacency relation. Every linear 
forest can be ordered, and indexed into as in Ex. 15.2. 19[ At the next step, index 
into the disjoint sum F as in Subsection 15.6. 1[ Finally, index into T using the 
inductive reduction (p: F ^ T. This indexing scheme outputs nearest neighbours 
of any vertex of T in time 0{D logra), requiring storage space 0{n), where n is 
the number of vertices in T. 

5.6.4 Projective reduction 

Let Wi = Xi, Qj), i = 1, 2 be two workloads. A projective reduction of Wi 
to W2 is a pair of mappings r : Q.i ^ il2, ■ Qi — > Q25 such that 

. r(Xi) C X2, 

• for each Q G Qi, r(Q) C r^(Q). 
Notation: Wi =^ W2. 



5. 6. NEW INDEXING SCHEMES FROM OLD 



157 



An access method for W2 leads to an access method for Wi, where a query 
Q G Qi is answered as follows: 

'Algorithm 5.6.2: iyi.RETRiEVEQuERY((5) 
comment: Wi = Qi) ^ W2 = (1^2, ^2, Q2), Q eQ 

R2 ^ H^2-RETRIEVEQUERY(r^((5)) 

comment: i?2 = ^2 H r^{Q) 

for each y & R2 

{for each x e r"^{y) 
y then i?i ^ i?i U {x} 

return (_Ri) 

Let ^2 = {T2, =^27 9^2) be a consistent indexing scheme for W2. The projective 
reduction Wi ^ W2 canonically determines an indexing scheme Ji = r*(32) as 
follows: Ti = T2, Bi'^ = r-\Bf\ and fi'\Q) = fi'\r-{Q)). 

Example 5.6.4. The linear scan of a dataset is a projective reduction to the trivial 
workload: 

If W = (fi, X, Q) is a workload and fi' is a domain, then every mapping 
r: f2 ^ determines the direct image workload, r^{W) = r(X), r(Q)), 
where r{X) is the image of X under r and r(Q) is the family of all queries 

r{Q),Qe Q. 

Example 5.6.5. Let S be a finite collection of blocks partitioning Vt. Define the 
discrete workload 2'^), and define the reduction by mapping each w E Vt 

to the corresponding block and defining each r^{Q) as the union of all blocks 
that meet Q. The corresponding reduction forms a basic building block of many 
indexing schemes Il36l . 

Example 5.6.6. Let Wi, i = 1, 2 be two metric range similarity workloads, that 
is, their query sets are generated by metrics di, i = 1,2. In order for a mapping 
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f : Qi ^ Q2 with the property f{Xi) C X2 to determine a projective reduction 
/ : Wi ^ W2, it is necessary and sufficient that / be 1-Lipschitz: indeed, in this 
case every ball will be mapped inside of the ball ^^(/(x))^ in Y. 

Example 5.6.7. More specifically, the following technique (described in detail 
in [iBGlD is often used to map metric spaces into £00 in order to use vector space 
indexing schemes such as the R-tree (Example 15 .3 .41) . 

Let d) be a metric space and choose n 1-Lipschitz functions /i, /2, • • • fn- 
It is easy to see that the map cu t-^ (fi(uj), f2{^), • • • , fn{^)) is a 1-Lipschitz map 
^ ^00 thus induces a projective reduction to the vector space workload. 
The most common way of choosing the required 1-Lipschitz functions is to select 
n pivots Xi,X2, . . .Xn and set fi{uj) = d{xi, u). 

Example 5.6.8. Pre-filtering is an often used instance of projective reduction. 
In the context of metric similarity workloads, this normally denotes a procedure 
whereby a metric p is replaced with a coarser distance d which is computationally 
cheaper. While the distance d need not be a metric (in fact it need not even satisfy 
the triangle inequality), it is necessary and sufficient that d{x, y) < p{x, y) for all 
x,y E ^ for the identity map to induce a projective reduction. The QIC-M-Tree 
||39l provides an example of this approach. 

Example 5.6.9. A frequently used tool for dimensionality reduction of datasets is 
the famous Johnson-Lindenstrauss lemma II102II . Let f2 = be an Euclidean 
space of high dimension, and let X C be a dataset with n points. If £ > 
and p is a randomly chosen orthogonal projection of onto a Euclidean sub- 
space of dimension k = 0{\ogn)/e'^, then with overwhelming probability the 
mapping (^y^N/kj p does not distort distances within X by more than the factor 
of 1 ± £. More results of the same type, for embedding n-point datasets into lower 
dimensional linear (not necessarily Euclidean) spaces, were obtained in II127L 

Such techniques do not extend with the same distortion to the entire domain 
r2 = M^, meaning that they can be only applied to construct consistent indexing 
schemes for the inner workload (X, X, Q), and not the outer workload (Q, X, Q). 
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5.7 Performance and Geometry 

In the preceding sections we were mostly concerned with the abstract foundations 
of indexing and similarity search and therefore have mostly ignored the issue of 
the performance. This is of course the key question: the rationale for indexing is 
exactly that it is supposed to speed up searches. Our definitions of similarity work- 
load and indexing scheme clearly point towards a geometric setting for answering 
the questions about the performance. Here we attempt to examine some factors 
concerning the performance of indexing schemes, albeit at a purely conceptual 
level. This is indeed the only possible way without either a concrete dataset, or 
very detailed assumptions about the workload. 

Our main result is yet another way of describing the Curse of Dimensionality 
which is a general observation that indexing schemes for high dimensional spaces 
perform very badly - often an optimised sequential scan performs better. The 
framework we use was first introduced in [I154II : a metric similarity workload is 
identified with an mm-space where the measure reflects the distribution of query 
points. We use the techniques from HI 5411 to derive the lower bounds on the num- 
ber of blocks that must be processed in order to answer a range query of radius 
e. 

5.7.1 Cost model for indexing schemes 

In estimating the performance of indexing schemes, as with other algorithms and 
data structures in computer science, we are primarily interested in two quantities: 
the space occupied by the indexing structure and the time required to process the 
query. As always there is a tradeoff between the two. For example, for an n-point 
dataset, sequential scan (Example 15.2. 171) takes time with Q(n) space (the 
space necessary to store all data points) while, if the workload is inner, hashing 
(Example l5.2.18l) takes time with f2(|Q|) space. Therefore, an investigation 
of performance of an indexing scheme has to take into account both the space 
and the query time complexity as well as the time required to build or update the 
structures. 
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The space complexity is of great importance in practice, especially with large 
datasets - often we are constrained to take no more than 0{n) space. However, we 
shall concentrate mostly on the query time complexity since the space complexity 
can be easily estimated directly. At this stage we deliberately ignore the index 
creation complexity - we always assume that an index is already constructed, that 
is, that all of (T, ^, 3") are defined. 

The general goal of indexing is to produce access methods that have time com- 
plexity sublinear in the size of the dataset. Often, the authors of indexing schemes 
claim to achieve O(logn) time (see for example a summary of space and time 
complexities of existing metric indexing schemes in [[361 '). but this claim usually 
only holds for 'small' queries. Nevertheless, in practice, even a constant reduction 
of the number of data points to be scanned, say to 10%, if not accompanied with 
a too large overhead, is worthwhile pursuing. 

General time complexity 

In most general terms, the time required to process query Q E Q using a consistent 
indexing scheme J = (T, 3") on a workload W = X, Q) is given by the 

time((5) = timer(Q) + time^{Q) + timej-{Q) (5.7) 

where time(Q) is the total time required to process query Q, timeT(Q) is the 
time associated with traversing the nodes of T, timesr^Q) is the total time spent 
evaluating decision functions at all visited inner nodes of T and time_^{Q) is the 
total time spent scanning the sets B D X for each block B E ^ associated with 
the leaf nodes visited. 

The timer (Q) is mostly associated with the data structures required for tree 
traversal. It includes the cost of retrieving the nodes from secondary memory (I/O 
costs) if it is used as well as the cost of any additional data structures used. For 
example, some algorithms for kNN similarity search [|93l , which are described in 
more detail in the context of our indexing scheme for peptide fragments in Chapter 
[6l make use of priority queue for tree traversal. Under some circumstances, such 
as the large number of nearest neighbours required, both the space and the time 
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costs of the priority queue are not negligible. On the other hand, if the whole 
structure is stored in primary memory and no expensive data structures are used, 
the timer (Q) can be very small compared with the other two times and is often 
ignored ll36l . 

The equation 15.71 can be elaborated in the following way: let S{Q) be the 
set of nodes of T visited in order to retrieve a query Q. Denote by I{Q) the set 
/(T) n S{Q) and by L{Q) the set L{T) n S{Q). Then we have 

time{Q) = timer (Q) + ^ ^ time(Q,x)+ ^time(Q,Ft) (5.8) 

tGL{Q)xeBtnx tei{Q) 

where time(Q, x) is the time required to check if x E Q and time((5, Ft) is the 
time required to evaluate Ft{Q). 

Most frequently, we are not interested in the performance for a single query 
but in either the average or the worst case performance. However, in order to 
measure the average search time it is necessary to have a probability distribution 
on the set queries Q. We shall return to this theme in Subsection l5.7.2[ 

Example 5.7.1. In [36] the general cost of a (range) query for a metric index- 
ing scheme is measured by the number of distances evaluated. In this case the 
time{Q, x) is the time taken to evaluate the distance from the query centre u to 
X and it is assumed that each evaluation of a certification function is based on 
one or more distance evaluations. The I/O costs (timeT(<5)) ^re ignored and it is 
assumed that other costs of the indexing structure are an order of magnitude less 
than costs of distance evaluations. 

Example 5.7.2. A more elaborate cost model, consistent with the Equations 15.71 
and 15.81 was proposed by Ciaccia and Patella |[39l in the context of the QIC- 
M-tree (Example 15.3.101) . Since the QIC-M-tree is a paged structure, the I/O 
costs are explicitly included. The time^(Q) depends only upon the comparison 
distance dc (it is exactly the time to evaluate query distances to all points retrieved 
from the leaf nodes) while the timegr((5) depends on the index distance dj as 
well as dc- The authors note that the performance does not depend directly on 
the query distance rfg which is approximated by dj and dc, give formulae for 
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the average costs in terms of the distributions of dj and dc and develop ways to 
choose comparison distances so as to optimise performance. 

Redundancy and Access Overhead 



In their 1997 paper [|87|| and its foUowup with additional coauthors Miranker and 
Samoladas 118611 . Hellerstein, Koutsoupias and Papadimitriou proposed two mea- 
sures of performance of indexing schemes: redundancy and access overhead and 
showed that there is a tradeoff between the two. We present the adaptations of 
their concepts to our model. 

Definition 5.7.3. Let W = (fi, X, Q) be a workload and J = (T, ^, J") an index- 
ing scheme. The redundancy r(x) of x E X is the number of blocks that contain 
X, that is, 

r(x) = \{B e ^ : X e B}\. 

The average redundancy r(J), of the indexing scheme J, is the average of r(x) 
over all data points: 



Definition 5.7.4. Let W = (fi, X, Q) be a workload and 3 = (T, ^, J") an index- 
ing scheme. For a query Q e Q denote, as before, by L{Q) the set of leaf nodes 
visited to answer Q. The access overhead A{Q) of query Q is defined as 

max{|gnX|,l}' 
The (worst case) access overhead A (J) for indexing scheme U is 

= sup{A(g) I g G Q}. 

If furthermore all blocks Bt E ^ contain m data points, we define the block 
access overhead Aag{Q) of query Q by 



max{\\Q nX\ /m] ,1}'' 
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and of indexing scheme 3 by A^^^J) = sup {A_6g{Q) \ Q e Q}. 

If yu is a probability measure on Q, we define the average access overhead 
Ai^) for the indexing scheme J by 



The access overhead A{Q) measures the cost of answering the query Q us- 
ing the set of blocks ^ (that is, the time^ - the costs associated with T and 5" 
are ignored) normalised by the ideal cost and hence takes values in [1, oo). The 
block access overhead measures the same cost in terms of block accesses and cor- 
responds to the original definition of access overhead in [87] . Our new definition 
was chosen in order not to depend on block size which in some indexing schemes 
may vary considerably and to allow for empty queries which do take time to pro- 
cess. 

The main result of [|86l is the Redundancy Theorem which in a workload in- 
dependent way gives a lower bound for the redundancy in terms of the block size 
and access overhead. 

Theorem 5.7.5 ([[83). Let W = (fi, X, Q) be a workload and J = (T, ^, 3") an 

indexing scheme such that all blocks contain m datapoints and A,^{3) < ^/m/A. 
Let Qi,Q2 ■ ■ ■ ,Qm be queries such that for every i = 1,2,..., M: 

(i) I n X I > m/2, and 

(ii) \Q^ n Qj nX\< m/16Alg, for all j = 1,2, M and j i. 

1 ^ 

Then, the average redundancy is bounded by r(J) > — — — N IQi H X|. 




and the average block access overhead Agg{^) by 




▲ 



12 IX 
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In most applications, due to space constraints, the redundancy of each data- 
point X is set to 1, that is, there is only one block containing x. The Theorem l5.7.5l 
then gives the lower bound for the block access overhead provided the queries do 
not pairwise intersect to a too great extent. If a better block access overhead is 
desired while block size stays the same, it is necessary to increase the (average) 
redundancy. 

5.7.2 Workloads and pq-spaces 

In order to estimate the average performance it is necessary to have a probability 
distribution on the set of queries which is often not available in any useful form. 
This is true in particular for similarity workloads with range queries which depend 
both on the query centre G f2 and the radius e. Subsequently, we shall assume 
that the radius is fixed and attempt to analyse the performance of indexing schemes 
with only as a parameter. 

Indeed, there are good reasons to consider performances of indexing schemes 
for different search radii separately. We show in Subsection 15.7.31 that there are 
significant qualitative differences between performances at different scales. Fur- 
thermore, this approach corresponds with many real-life situations where the ra- 
dius has a direct, problem-specific interpretation and is chosen in advance. One 
example is biological sequence search performed by BLAST [[6l - in almost all 
practical cases the users do not change the default threshold which corresponds to 
the expected number of sequences to be retrieved according to a null model. The 
threshold is translated into a cutoff similarity score and thus into a quasi-metric 
radius (depending on the query centre only). 

Therefore, we shall assume that the domain Q is equipped with a (Borel) prob- 
ability measure n reflecting the distribution of query centres. If the dissimilarity 
measure d is a metric (respectively quasi-metric), it follows that the triple {il, d, /i) 
is a pm- (respectively pq-) space. The measure /i can always be approximated 

from the dataset itself: for any A C f2 set /i(A) = Z"' . This would im- 

1^1 

ply that the distribution of the query centres coincides with the distribution of the 
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dataset and is the approach taken in [l39ll . 

A complementary way of looking at the measure yU on 1] is to treat it as a sort 
of an 'ideal' measure and the dataset as an n-point sample according to /i. One 
can consider a family of datasets from fl distributed according to /i and attempt 
to construct an indexing scheme which would answer queries of all datasets effi- 
ciently. This was one of the reasons we defined the queries as subsets of rather 
than X. 

One can go even further by having two measures on 1] - one giving the dataset 
distribution as above and another, possibly very different, providing the distribu- 
tion of the query centres. It has long been observed in the context of relational 
databases [[37l that that it is necessary to consider non-uniform distributions of 
queries in order to well estimate the query performance and there is no reason to 
suppose that the same does not hold for similarity -based queries. However, the 
introduction of a second measure would present non-trivial technical challenges 
and we therefore leave it for subsequent work. 

5.7.3 The Curse of Dimensionality 

It has long been known (c.f. for example lfT6ll ) that exponential complexity might 
be inherent in any algorithm for answering near neighbour queries because a point 
in a high-dimensional space can have many 'close' neighbours. In fact, this phe- 
nomenon is not only associated with similarity searches but with other data anal- 
ysis related areas such as machine learning using neural networks ll22l . clustering 
[|92l . function or density estimation iMl, signal processing I1202II and many oth- 
ers. In all cases the procedures that perform well on two or three dimensional 
sets fail to do in higher dimensions. We take the paradigm of Pestov HI 5411 that 
the curse of dimensionality is primarily a manifestation of the concentration phe- 
nomenon. It allows us to use the techniques developed in Chapter |4] to provide 
estimates of performance of indexing schemes with as few assumptions as possi- 
ble regarding the nature of the dataset. We first outline the previous results for the 
nearest neighbour queries and then proceed to our contribution for range queries 
in quasi-metric workloads. 
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Nearest Neighbour Queries 

In their 1999 paper, Beyer et al. Il20l investigated the effect of dimensionality to 
the nearest neighbour problem. Their main result states that under certain condi- 
tions every nearest neighbour query (in a metric space) is unstable: the distance 
from any point to its nearest neighbour is very close to the distances to most other 
points. We outline here the contribution of Pestov HI 5411 who both relaxed the as- 
sumptions of Beyer et al. and obtained stronger conclusions using the techniques 
of the asymptotic geometric analysis, that is, the concentration phenomenon. 

Definition 5.7.6 ([|20l). Let {n, X, Q^^) be a workload where (fi, d) is a metric 
space and is the set of nearest neighbour queries. A query Q{uj, X) G is 
called e-unstable for an £ > if 

\{xeX: d{uj,x) < (1 + e)dx{uj)}\ > 

▲ 

Definition 5.7.7. Let (1), d, fx) be an pm-space and X C Q a finite subset. For an 
X e X denote by Rx = sup{r > : /i(*Br(a;)) < |} the maximal radius of an 
open ball in f2 centred at x of measure not more that i. For a 5 > we say that 
X is weakly 5 -homogeneous in if all radii R^, x E X belong to an interval of 
length less than S. A 

Theorem 5.7.8 ( I1154II '). Let {Q,d,fi) be an pm-space and X C f2 a finite sub- 
set. Denote by M a median value of dx, the distance from a point in Vt to its 
nearest neighbour in X. Let < e < 1 and assume that X is weakly {Me/6)- 
homogeneous in Q. 

Then for all points LJ G fl, apart from a set of total measure at most 3a{Me /6), 
the open ball of radius (1 + e)dx{uj) centred at uj contains at least 



2y/a{Me/6) 
elements ofx. 



min < IXI 
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Hence, provided that X is weakly (Afe/6) -homogeneous in Q (which it is, 
as remarked in I1154L with probability not less than 1 — 2 |X| a (Me/ 12) if X is 
sampled randomly with regard to /i) and that (fi, d, jj) has concentration property, 
with very high probability every nearest neighbour query is e-unstable. 

The point of all this is that in the case of query instability there is little infor- 
mation to be gained by the nearest neighbour search - the quality of results is such 
that they can not be well interpreted. Hinnenburg et al. ll9T]| proposed a solution to 
a generalised nearest neighbour problem by dimensionality reduction and weight- 
ing of the dimensions according to the query point. This amounts to a redefinition 
of a metric to be used. In all cases, it is not hard to see that the performance of 
any indexing scheme is poor if almost the whole dataset is to be retrieved. 

Range Queries 

Turning to range queries in quasi-metric spaces we adopt the paradigm outlined 
in Subsection 15.7. 2[ The radius is fixed while the query centres are distributed 
according to a measure ii onVt. We are interested in the number of blocks that 
need to be processed in order to answer the query '^^{uj) which would give us 
an estimate on the time.^ and the access overhead. Since metric and quasi-metric 
trees are built hierarchically so that at each level and at each node we have a set 
covering a portion of the dataset, the same result can be used to give an estimate 
for the timegr. 

Lemma 5.7.9. Let (X, d) be a quasi-metric space, A O X and < S < e. Then 

i^s)^, ^ Af> where 5' = e - 5. 

Proof. Suppose x G (^f Then there exists y G Af such that d{y, x) < e. By 
the Lemma[2T6l d{x, A) < d{x, y) + d{y, A) < 5' + 5 = e. □ 

Lemma 5.7.10. Let (X, d, ji) be a pq-space, A a Borel subset of X, e > and 
fi{A) > a^{e). Then /i(Af ) > i 

Proof Suppose that /i(A) > a^{e) and /i(Af ) < |. Let B = X \ . Then 
H{B) > \ and therefore /i(A) < /i(X \B^) = \- ii{B^) < a^ie), leading to a 
contradiction. □ 
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The following is proved using a similar technique to the Lemma 4.2 of [I154II . 
In addition to the worst case result similar to the one provided in II154II . we also 
give a bound for the average case performance which is arguably more important 
than the worst case. 

Theorem 5.7.11. Let d, ji) be a pq-space, e > and a collection of subsets 
B C Q such that /i (IJ=^) = 1 and for all B e ^, 12(B) < ^ < Denote by 
S = = mi{e > : a^{e) < ^} the generalised inverse of at ^. 

Then, for any e > S, 

L There exists oj eVL such that ^'^{oj) meets at least 

1 



mm 



[e - 5) 



- 1 



elements of SS. 

2. A left ball 53 {u) around oj eVL meets on average (inu ) at least 



mm 





"1" 




1 




{ 


1 


7 


(5 - 6) 


] 



elements of 

Proof. By assumption on each B E ^ and by the choice of 5, fi{B) < ^ < a^{5). 
Decompose ^ into a collection of pairwise disjoint subfamilies =^j, ? G / in a such 
way that a^{5) < ii{Ai) < 2a^{5) for each Ai = [j ^i. Clearly, 

1 



< / < 



1 1 

< 



Let 5' = e - 5 > 0. Then, by the Lemmas [5T9] and [5J7l0l 

and hence the probability that a random left ball of radius e does not intersect Ai 
is less than OL^{e — S). For any J I, 



{Q{A^:)f^>l-\J\a\e-5). 
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The first claim follows by choosing J such that \J\ = min < |/ 



mm 



in I 



1 



I SO that /i ^flieJ (^«)f ) > 0- To prove the second 
statement observe that the probability that a random ball of radius e meets at least 



2a«(e-<5) 



elements is at least |. Hence, the average number of subsets of 



intersecting a ball of radius e is at least 



4a-'*(e-<5) 



□ 



Our result directly leads to the following Corollary stated in terms of a range 
similarity workload (with fixed radius). Note that the open balls are replaced by 
the closed balls in order to be consistent with the definition of the range similarity 
workload. 



Corollary 5.7.12. Let e > 



la 



and W = {Q, X, Q) be a workload where 



uj E Vt} (the left closed balls are taken with respect to a quasi- 



metric d on Vt). Suppose the dataset X and the query centres are distributed 
according to the Borel probability measure fi on Q. Let SS be a finite set of 
blocks such that fJ'{[J^) = 1 and for any B E /i(-B) < ^ < \- Then the 
number of blocks accessed to retrieve the query ^^fioj) is on average at least 



4a«(e-(ai)<-(0) 

whichever is smaller 



and in the worst case at least 



1 



or 



□ 



As observed in ChapterlH for many metric spaces we have a{e) < CqC'^^^^^ 
where N is the dimension of the space. In this case it is easy to see that any index- 
ing scheme, unless its blocks have all very small measure, will need to scan very 
many blocks in order to retrieve not only the worst case but also a typical range 
query. Even if the access overhead is not large, the sequential scan of the whole 
dataset might outperform an indexing scheme due to the overhead associated with 
the tree structure. The bounds from the Theorem 15.7.111 while certainly not tight, 
give some indication on the number of blocks that can be expected to be retrieved. 

Note that the Theorem 15.7.111 holds only for e > 5 - the value 5 is the scale at 
which we observe such phenomenon. Obviously, at the scales smaller than 5 the 
indexing scheme need not suffer in performance. Observe that both and are 
involved but their role is not the same. The left concentration function determines 
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the scale at which the concentration effect take place while the a establishes 
the number of bins accessed. For 'bad' performance it is necessary that the 
decreases sharply near 0. 

Since our metric and quasi-metric indexing schemes, as defined in Sections 
15.31 and 15.41 involve covering sets at each level of the tree, it is straightforward 
to apply the Theorem 15.7.1 II to derive the bounds for the number of certification 
function evaluations at each level. 



5.7.4 Dimensionality estimation 

Unlike our approach above, which uses only geometric assumptions and where 
the performance is linked to the concentration functions, Pagel, Kom and Falout- 
sos II149II seek to estimate the performance of nearest neighbour query retrieval 
based on fractal (Hausdorff or correlation) dimensions of the dataset. This line 
of investigation stems from the observation that for real datasets embedded in 
vector spaces, features are often correlated and hence the estimates based on in- 
dependence assumptions are too pessimistic. Hence the effort to find the 'real' 
dimensionality of the datasets. 

Traina, Traina and Faloutsos [I188II introduced the distance exponent which 
gives the intrinsic dimension of any metric space by assuming that (at least for 
small e), the size of a ball grows proportionally to where N is the 

dimension of the space. They claimed that performance of metric trees could 
be well approximated in terms of the distance exponent. As a part of his summer 
research assistantship at the Australian National University in summer 1999/2000, 
the thesis author performed some experiments to determine the ways of estimating 
the distance exponent from the datasets. These previously unpublished results are 
presented in the Appendix |Al 

In [|36l another definition of the intrinsic dimensionality is given (again in 
terms of the distance distribution) and bounds on the number of distances to be 
evaluated by metric indexing schemes are derived. 
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5.8 Discussion and Open problems 

So far we have provided a conceptual framework for similarity search and hinted 
that the Curse of Dimensionality is related to the concentration phenomenon. The 
Theorem [5 .7.11 [ extends the previous results to the case of range searches in quasi- 
metric spaces. We next outline possible directions for further investigation. 

5.8.1 Workload reductions 

Our definition of an indexing scheme (Definition 15.2.151) emphasises the three 
structures which are found in all examples known to us: the set of blocks that 
cover the dataset, the tree structure supporting an access method and the decision 
functions. While this setting allows us to directly identify the factors that influ- 
ence the performance, access methods for similarity queries could be investigated 
through workload reductions as in Section 15. 6[ without the explicit reference to 
indexing schemes. 

Consider a tree workload, Wt = {T, T, Q) where T is a finite rooted directed 
weighted tree, such that every edge is assigned a zero weight in the direction 
towards the root and a positive weight in the opposite direction. The Q is the set 
of range similarity queries induced by the path quasi-metric (Section [2771) . There 
is an obvious access method associated with such workload: traverse the tree 
starting from the query point and retrieve all nodes closer than the cutoff value. 

Observe that any metric or quasi-metric indexing scheme where the blocks 
are pairwise disjoint can be represented as a projective reduction of the original 
workload Wq to a discrete workload mapping each point to its block, followed by 
an inductive reduction to a tree workload. In our notation. 

Wo ^ ^, 2^) i Wt. 

The requirement that the blocks are pairwise disjoint comes from r being a func- 
tion - this is a limitation that may need to be overcome. 

While this approach is perhaps too abstract and limited at this stage, hiding the 
decision functions in the reduction maps, it opens new lines of investigation. In 
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particular, one can ask if all access methods involve reductions to inner workloads 
and attempt to construct access methods involving inductive reductions to non-tree 
workloads. 

Another topic for investigation would be to construct a hierarchy of all work- 
loads (with measures on the sets of queries) according to their indexability, a term 
introduced in [|87l . For example, a workload would be higher in the hierarchy if it 
is more difficult to index and one could decide indexability of any particular work- 
load in reference to some canonical workloads. It is clear that the trivial workload 
should be on the top of the hierarchy as the most difficult to index. 

For mm-spaces, one can hope to be able to use Gromov's relation >- between 
mm-spaces ( 11791 . Chapter 3|, pp. 133-140): for two mm-spaces X and Y, X 
(Lipschitz) dominates Y, denoted X y Y,if there exists a 1-Lipschitz map X 
Y pushing forward the measure fix to a measure u on Y proportional to /iy. 
Obviously, a one point space {*} (with any measure) is a minimal mm- space and 
the more concentrated a space is, the more it is dominated by other mm-spaces. 
This notion should be able to be generalised to quasi-metric spaces with measure. 
Going even further, one would wish to include the dataset in any resulting theory. 

5.8.2 Certification functions 

As we noted before, the bounds from the Corollary 15.7.121 are not tight - they 
usually indicate better than actual performance. Indeed, much closer estimates 
can be obtained if the distributions of the values of the certification functions 
are known, such as in [|39l where they correspond to the distance distributions. 
Ciaccia and Patella also emphasise that their model attests that the performance 
depends only on the distributions of the index and comparison distances (i.e. the 
certification functions) and not on the query distance. This is not contrary to our 
results - our bounds are for a best possible indexing scheme and the performance 
in practice could be much worse. 

Hence, there are reasons to believe that the main reason for the Curse of Di- 
mensionality is not the inherent high-dimensionality of datasets, but a poor choice 
of certification functions. Efficient indexing schemes require usage of dissipat- 
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ing functions, that is, 1-Lipschitz functions whose spread of values is more broad, 
and which are still computationally cheap. Such functions correspond to 'tighter' 
covering sets with little overlap between them. This interplay between complex- 
ity and dissipation is, we believe, at the very heart of the nature of dimensionality 
curse, at least in relation to the timeg^. Requirements for blocks to contain certain 
number of points have a large contribution as well. 

Generic metric indexing schemes use only distances (from points) to construct 
their certification functions. While this ensures that they can be applied to any 
metric space, it may also be significant limitation if the distances are computa- 
tionally expensive. More specific knowledge of the geometry of the domain is 
clearly necessary to produce computationally cheaper certification functions. The 
QIC-M-tree 1391 is a great step in this direction as it allows the user to specify 
three distances to be used. It should be possible to go even further by developing 
a structure which allows the user to specify classes of certification functions and 
an algorithm which fits them to a dataset and produces an indexing scheme. The 
insight gained by the approaches attempting to reduce overlap between the cover- 
ing sets associated with the nodes of a metric tree, such as Slim-trees II189II , will 
no doubt play a role. 

5.9 Conclusion 

Our proposed approach to indexing schemes used in similarity search allows for a 
unifying look at them and facilitates the task of transferring the existing expertise 
to more general similarity measures than metrics. In particular, we have extended 
the concepts associated to metric workloads to the quasi-metric workloads. 

We hope that our concepts and constructions will meld with methods of geom- 
etry of high dimensions and lead to further insights on performance of indexing 
schemes. While we have not yet reached the stage where asymptotic geometric 
analysis can give accurate predictions of performance as there exists no algorithm 
for estimating concentration functions from a dataset, at least it leads to some 
conceptual understanding of their behaviour. We have deliberately ignored non- 
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consistent indexing schemes in our discourse - while they may show much better 
performance, they do so at a price of losing some members of the query. 

In the next Chapter we shall further illustrate our concepts on the concrete 
dataset of peptide fragments and point out some specific issues affecting perfor- 
mance of indexing schemes. 



Chapter 6 

Indexing Protein Fragment Datasets 



While the previous chapters emphasised the theory, laying the foundations and in- 
troducing the concepts, the present chapter and the one following focus on appli- 
cations to actual protein sequence datasets. The present chapter has two principal 
aims: to illustrate the notions of Chapter[5]on the sets of biological sequences and 
to introduce an indexing scheme for datasets of short peptide fragments to be used 
for biological investigations of Chapter |71 

An additional reason for studying indexing schemes for short peptide frag- 
ments is that it has been frequently pointed in the literature Il32lll43ll99lll00[|103l 
|29l I144[ ITOl that algorithms for indexing short fragments could be used as sub- 
routines of BLAST-like programs for searches of full sequences. It is hoped that 
as a part of the future work, the experience gained from indexing short fragment 
could be applied to the challenge of indexing datasets of full DNA and protein 
sequences. 

6.1 Protein Sequence Workloads 

Let E denote the standard 20 amino acid alphabet. A full sequence workload has 
the domain S* and the sets of queries consisting of range or kNN queries based 
on the quasi-metric corresponding to the local (Smith- Waterman) similarity scores 
based on BLOSUM matrices and affine gap penalties. The dataset in this case is 
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any actual set of protein sequences. 

A short fragment workload has the domain S™, the set of all amino acid se- 
quences of length m which will mostly range from 6 to 12. The set of queries 
consists of range or kNN queries based on an £i-type quasi-metric extending a 
quasi-metric ds on S (Section [321) • The co-weightable quasi-metric ds is derived 
from a similarity score matrix s from the BLOSUM family using the formula 
dY;{x,y) = s{x,x) — s{x,y) while the dataset is obtained from a full sequence 
dataset by taking all fragments of length m from all sequences. 

Depending on the protein sequence dataset, there may exist cases where two 
short fragments have the same sequence (Subsection 16.1.21) . For the purpose of 
this thesis, a kNN query is defined with respect to the original fragment dataset 
(which is therefore a pseudo-quasi-metric space), not to the quotient set where 
points with identical sequence are merged into one point. 

Most of the present chapter, as well as Chapter U\ examines short fragment 
workloads with some ideas transferable to full sequence workloads. The remain- 
der of the present section investigates some geometric aspects of sets of short 
peptide fragments. 

6.1.1 Sequence datasets 

Two protein sequence datasets were used for investigations of the present chapter: 
NCBI nr (non-redundant) ll208ll and SwissProt f23l . 

The NCBI nr dataset is a comprehensive general protein sequence database, in- 
cluding entries from most other major protein sequence databases (such as SwissProt) 
as well as the translated coding sequences from GenBank entries (GenPept). Where 
multiple identical sequences exist, they are consolidated into one entry. The nr 
dataset is the main dataset searched by NCBI BLAST and the latest version can be 
downloaded from |f tp : / / ftp . ncbi . nlm. nih . gov/blast /d b / where other 
datasets searched by NCBI BLAST can be found as well. Since the full nr dataset 
is very large (the version from June 2004 contains 1,866,121 sequences consist- 
ing of 619,474,291 amino acids) smaller samples rather than the full dataset were 
used. It should be noted that many protein sequences belonging to GenPept and 
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hence nr were translated from coding segments of GenBank sequences that were 
verified solely using computational techniques, that is, without experimental vali- 
dation. Thus, nr may contain sequences which are not expressed in any organism. 

The SwissProt dataset, maintained at the Swiss Institute of Bioinformatics 
http : / /www . expasy . or g/sprot /j is "a curated protein sequence database 
which strives to provide a high level of annotation (such as the description of the 
function of a protein, its domains structure, post-translational modifications, vari- 
ants, etc.), a minimal level of redundancy and high level of integration with other 
databases". Its entries contain, apart from the sequence information, extensive 
functional annotation, literature citations and links to other resources. Because of 
its moderate size, non-redundancy and high level of sequence characterisations, 
SwissProt (Release 43.2 of April 2004, containing 144,731 sequences consisting 
of 53,363,726 amino acid residues) was used as the main dataset for the experi- 
ments of this chapter. 

6.1.2 Unique fragments 

SwissProt and nr are (almost - there are few duplicate sequences in SwissProt) 
non-redundant. However, when short fragments are taken to form the fragment 
database, it often occurs that multiple instances of the same fragment exist (Figure 
16.11) . In other words, the underlying measure on where m is small is not the 
counting measure. 

For similarity searches, this situation can be handled in two ways. If many 
duplicate fragments are present (very short fragment lengths), a preprocessing step 
is necessary to collect the identical fragments together, introducing some space 
overhead but significantly saving search time. If relatively few duplicates (longer 
fragment lengths) are present, they can be treated as separate points introducing 
an additional time cost for unnecessary distance evaluations but avoiding space 
overhead for collecting identical fragments. 

A further observation that can be made from the Figure 16.11 is that for very 
short fragments, almost every possible sequence is represented in the dataset - 
the workload is effectively inner, allowing the possibility of using combinatorial 
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Figure 6.1: Percentages of unique fragments of fixed lengtli from tlie SwissProt dataset 
out of total fragments in the dataset and total possible fragments The fragments 

containing letters not belonging to the standard amino acid alphabet were ignored. 

algorithms for indexing. This is definitely not true for longer fragments and full 
sequences where the workload is outer. For example, the number of potential frag- 
ments of length 10 is 20^° while there are only about 38.5 million (or 0.0004%)) 
unique fragments in SwissProt. 

6.1.3 Random sequences 

Most experiments of this chapter, investigating geometry of datasets and perfor- 
mance of indexing schemes, involve simulating a probability measure on the set 
of all possible protein fragments using generated random sequences. It is neces- 
sary to do so because the workloads (with the exception of sets of fragments of 
very short lengths) are outer and it is quite likely that a query sequence would be 
(slightly) different from all sequences existing in a dataset. Generally, the 'true' 
distribution of protein sequences or fragments is unknown and the measure ob- 
tained by counting the points of an actual dataset is not appropriate because the 
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full natural variation of protein sequences cannot be captured by any dataset, that 
is, one always expects to discover novel sequences. Hence, it is necessary to use 
theoretical models of sequence distributions and attempt to balance the practical 
issues, such as the ability to quickly generate sufficiently many random sequences, 
with accuracy. 

The simplest way of generating random fragments of fixed length is to as- 
sume the underlying measure is the product measure based on background (over- 
all) amino acid frequencies, that is, to generate each fragment by an independent, 
identically distributed process where the probability measure is given by the back- 
ground frequencies. Such approach can be extended to sequences of arbitrary 
length by modelling sequence length according to some distribution (for example, 
discretised log-normal II151II ') and once the length is chosen, proceeding as above. 

A more general model, actually used to generate testing datasets for the ex- 
periments of the current chapter, is based on Dirichlet mixtures I1174II . As in the 
previous case, the length of each sequence is taken from a discretised log-normal 
distribution and the amino acids of a sequence are generated by an independent, 
identically distributed process. However, the probabilities for that distribution are 
selected from a mixture of Dirichlet densities (for a description of Dirichlet distri- 
butions and mixtures see Chapter 1 1 of the Durbin et.al. book [|52ll ') instead from 
a single (background) distribution. 

The code and the data for generating random sequences according to Dirichlet 
mixtures were obtained from Ihttp : / / www . cse . ucsc . edu /r e search / compbio/ dirichlet 
To obtain samples of fragments of fixed length to be used in experiments, for 
each desired length, 5000 non-overlapping fragments were sampled from full se- 
quences generated according to the above method. The same testing datasets were 
used for all experiments ensuring that performances of different indexing schemes 
can be directly compared. 

6.1.4 Quasi-metric or metric? 

Chapter[3]has shown that most common distances on protein sequences are quasi- 
metrics. However, since the theory and practice of indexability of metric spaces 
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is much better studied, it is worthwhile to investigate the overhead of replacing a 
quasi-metric by a metric. 
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Figure 6.2: Mean ratio between the sizes of smallest metric and quasi-metric balls con- 
taining k nearest neighbours with respect to the BLOSUM62 quasi-metric. Each point is 
based on 5,000 searches of SwissProt fragment datasets using randomly generated frag- 
ments as ball centres. 

From the point of view of performance, the best measure of the average over- 
head is the ratio between the sizes of the metric and the quasi-metric ball con- 
taining at least k nearest neighbours with respect to the quasi-metric. If this ratio 
is close to 1, the metric and the quasi-metric have similar geometry and the re- 
placement of the quasi-metric by a metric is feasible. The average sampled ratios 
for the fragment datasets of lengths 6, 9 and 12, using the associated metric (the 
smallest metric majorising the quasi-metric), are shown in the Figure [6^ 

It is clear that replacement of quasi-metric by a metric would be very costly 
except for the nearest neighbour searches of very short fragments (length 6) and 
that it is indeed necessary to develop the theory and algorithms that would allow 
the use of the intrinsic quasi-metric. This observation was one of the principal 
motivations behind the development of the theory of quasi-metric trees in Chapter 

m 
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6.1.5 Neighbourhood of dataset 

A further way of assessing the way a dataset is embedded into its domain is by 
considering how far the closest point from the dataset is to any point in the do- 
main, or alternatively, the smallest e such that the dataset forms an e-net inside 
the domain. Even more information is revealed by the distribution of distances of 
points in the domain to the dataset; for example, it can be determined if there is a 
sizable amount of points significantly farther from the dataset than the rest. Note 
that such distribution function clearly depends on the underlying measure on the 
domain (query distribution). 

While an overwhelming amount of computation would be necessary to obtain 
the exact distribution, it is possible to approximate it by resorting to simulation, 
that is, by generating points according to the assumed measure and finding for 
each generated point the distance to its nearest neighbour in the dataset. If an effi- 
cient indexing scheme is available, such approach is computationally inexpensive. 
Figure 16.31 shows the results for SwissProt fragment datasets of lengths 6, 9 and 
12 using the sample points generated according to Dirichlet mixtures (Subsection 
[6T31) . 

The estimated distribution for the fragments of length 6 supports the observa- 
tions from Subsection 16. 1.21 that the workloads based on sets of fragments of very 
short length are close to inner: almost 60% of random points are in the dataset (the 
BLOSUM62 quasi-metric (Figure [6. 101) and hence its derived ii type distance on 
fragments is Ti and therefore the distance of implies identical fragments) and 
most of the remainder are within one amino acid substitution from a dataset point 
(Figure 16.101 shows the full BLOSUM62 quasi-metric). In fact, the number of 
random points belonging to the dataset is much greater than the proportion of the 
dataset in the domain from the Figure [6?n (about 30%), which is essentially based 
on the counting measure on the domain. This (not surprisingly) indicates that the 
measure based on Dirichlet mixtures indeed approximates the dataset better than 
the counting measure. The distributions for the lengths 9 and 12 indicate that a 
neighbour is very likely to be found in the biologically significant ranges (20-35). 
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Figure 6.3: Distributions of BLOSUM62 distances from random fragments to the 
SwissProt fragment datasets. Based on 5000 random fragments generated according to 
Dirichlet mixtures. 



6.1.6 Distance Exponent 

Distance exponent (Appendix lAj) . measuring the rate of growth of balls in a met- 
ric space can be used to estimate the dimensionality and hence the complexity of 
workloads. The theory presently applies only to metric spaces (although the ratio- 
nale is equally valid for quasi-metric spaces) and therefore the associated metric 
to the BLOSUM62 quasi-metric was used. Since the estimate of the dimensional- 
ity of the full domain, rather than just of the dataset was desired, the average size 
(in terms of points of the dataset) of a ball of given radius centred at a random 
point was computed and used to estimate the distance exponent. This approach 
is justified by the Remark lA. 1.61 provided the measure induced by the dataset is 
a good approximation to the measure used to generate the ball centres (i.e. the 
measure on the domain). The sizes of the balls of small radii for datasets of length 
6 and 9 are shown in Figure \6A\ (log-log scale). 

It is apparent that the log-log graphs are not linear and therefore the method 
based on fitting a polynomial (Subsection IA.3.21) was used for distance exponent 
estimation. The estimated distance exponent is 7.6 for the fragments of length 6 
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Figure 6.4: Growth of balls centred at 5000 random fragments generated according to 
Dirichlet mixtures. The balls are taken with respect to the metric associated to the BLO- 
SUM62 quasi-metric. 



and 10.6 for the fragments of length 9. Hence, in this context, the datasets are 
approximately equivalent to the cubes [0, 1]^ and [0, 1]^^ respectively, with the 
£oo metric (Subsection lA. 2 . 1 1) . An interesting problem is to determine if 'good' 
embeddings into cubes A[0, 1]" exist and if so, to index them as vector spaces, say 
using X-tree. 

6.1.7 Self-similarities 

As mentioned previously, in Chapter [3] as well as in the current chapter, protein 
sequence fragments with (some) BLOSUM similarity measures can be treated as 
co-weighted quasi-metric spaces with the co-weight of each point given by its 
self-similarity. Self-similarities are significant because they are the sole source 
of asymmetry of the quasi-metric: we have T{x,y) = \d{x,y) — d{y,x)\ = 
\s{x,x) — s{y,y)\ where F denotes the asymmetry function introduced in Sec- 
tion |4]6l Therefore, the distribution of self-similarties determines the 'distance' 
of the quasi-metric space from its associated metric space. Furthermore, if self- 
similarities of dataset points take very few values, as is the case with short frag- 
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ment datasets, the co- weighted quasi-metric space can be divided into metric fi- 
bres which can be indexed separately using an indexing scheme for metric work- 
loads (FMtree - Example 15.6. II) . Figure [631 shows the estimates of distributions 
of self-similarities of SwissProt fragment datasets of length 7 and 12 based on 
approximately 1,000,000 samples. 
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Figure 6.5: Distributions of self-similarities of SwissProt fragment datasets: (a) Length 
7; (b) Length 12. 



It can be seen that both distributions are skewed to the right and that the dis- 
tribution for the length 12 is more spread out, that is, less concentrated. However, 
if something is to be inferred about the measure concentration and hence index- 
ability from self-similarities, it is necessary to take into account the scale. The 
median distance to the nearest neighbour for the length 12 workload is about 23 
(Figure lOl) while it clearly cannot be greater than 10 in length 7 case (the data 
for length 7 is not available in the Figure [63] but it can be inferred from the data 
for lengths 6 and 9). Thus, if scaled in this way, the distribution for the length 7 
would be indeed less concentrated. 
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6.2 Tries, Suffix Trees and Suffix Arrays 

Trie, suffix tree and suffix array data structures form the basis of many of the 
established string search methods and provide an inspiration for some features of 
the FSIndex access method described in Section [63l 

Let S be a finite alphabet and X be a collection of S-strings (i.e. X C S*). A 
trie [[601 is an ordered tree structure for storing strings having one node for every 
common prefix of two strings. The strings are stored in extra leaf nodes (Figure 
16.61) . A PATRICIA tree (Practical Algorithm to Retrieve Information Coded in 
Alphanumeric HMOID is a compact representation of a trie where all nodes with 
one child are merged with their parent. Tries and PATRICIA trees can be easily 
used for string searches, that is, to find if a string p belongs to X. Such searches 
take 0{n) time where n = \p\. 

Now consider a single (long) string t E X where m = \t\. The suffix tree [I206II 
for t is the PATRICIA tree of the suffixes of t and can be constructed in 0(m) 
time [|206[ 11361 1190II . Suffix trees, in their original form as well as generalised to 
suffixes of more than one string, can be used to solve a great variety of problems 
involving matching substrings of long strings (Gusfield, in his book [[83l dedicates 
full five chapters exclusively to suffix trees and their applications). 




txi>'>'txitxi>' |Ca>'>'txitxi> 



Figure 6.6: A trie (left) and a PATRICIA tree (right) for a set of six strings of length 4. 
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One disadvantage of suffix trees is that they often occupy too much space - 
up to 6(m |S|) in many common cases f,83i] . The suffix array data structure, first 
proposed by Manber and Myers II129II . is a compact representation of the suffix 
tree for t consisting of the array pos, of integers in the range ... m — 1 specifying 
the lexicographic ordering of suffixes of t (i.e. pos[i] is the starting position of the 
2-th suffix of t in lexicographic order), and the array Icp, where lcp[i] contains the 
longest common prefix of the substrings starting at positions pos[i — 1] and pos[i] 
(the first element of Icp is 0). Efficient 0{m) construction algorithms exist and 
using binary search on array pos and the Icp values, it is possible to search for 
occurrence of a string p in t in 0{n + log m) time, where n = \p\ [|83l . Figure [677] 
shows an example of a suffix tree and a suffix array. 

PATRICIA trees (and hence suffix trees and arrays), being compact represen- 
tations of a set of strings, can be used to speed-up string comparisons and searches 
fyil . Indeed it is very easy to construct a quasi-metric tree for the short fragment 
similarity workload (S'", X, Q) (Section [6]T]) with a quasi-metric rfs. The tree is 
given by a trie or a PATRICIA tree for X and each block is a set containing a 




Figure 6.7: A suffix tree and a suffix array for the word ABBBAABA. 
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single fragment associated with a leaf node. At each non-root node, a certification 
function calculates the distance between a prefix given by the path from the root 
to the node in question and a prefix of the query fragment of the same length, say 
k. In effect, a certification function calculates the distance from the query to the 
'cylindrical set' of fragments where the letters at first k positions are fixed while 
varying arbitrarily at the remaining m — k positions. 

6.3 FSIndex 

FSIndex is an access method for short peptide fragment workloads mainly based 
on two procedures: combinatorial generation and amino acid alphabet reduction. 

For very short fragments (lengths 2-4), the number of all possible fragment 
instances is very small (for length 3, 20'^ = 8000) and almost every fragment 
instance generated exists in the dataset. Hence, it is possible to enumerate all 
neighbours of a given point in a very efficient and straightforward manner using 
digital trees or even hashing. For larger lengths, the number of fragments in a 
dataset is generally much smaller than the number of all possible fragments (Fig- 
ure 16.11) and generation of neighbours is not feasible. If it were to be attempted, 
most of the computation would be spent generating fragments that do not exist 
in the dataset. Hence the idea of mapping peptide fragment datasets to smaller, 
densely and, as much as possible, uniformly packed spaces where the neighbours 
of a query point can be efficiently generated using a combinatorial algorithm. 

Partitions of amino acid alphabet provide the means to achieve the above. 
Amino acids can be classified by chemical structure and function into groups 
such as hydrophobic, polar, acidic, basic and aromatic (Table 11.11) . Such clas- 
sification appears in every undergraduate text in biochemistry and has been previ- 
ously used in sequence pattern matching I1176II . In general, substitutions between 
the members of the same group are more likely to be observed in closely related 
proteins than substitutions between amino acids of markedly different properties. 
The widely used similarity score matrices such as PAM [|45ll or BLOSUM [[88ll 
are derived from target frequencies of substitutions and therefore capture these 
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relationships more precisely. 

The required mapping is constructed as following. Given a set of fragments 
of fixed fragment length S™, an alphabet partition tTj : S — is chosen for each 
positions = 0, 1 ... m — 1, where < |S|. This induces the mapping tt : S™ 
So X Si X . . . S„_i where 7r(aoai • • • flm-i) = 7ro(ao)7ri(ai) . . . 7r„_i(am-i). The 
members of Sq x Si x . . . Sm_i are called bins and the number of bins is denoted 
by A^. The partitions tt^ are often equal for each i. An important consequence 
of such mapping is that distances to bins are easy to compute and can be used as 
certification functions. 

Remark 6.3.1. Positions in each fragment are zero based, that is, numbered from 
rather than from 1, because the reference implementation of FS Index is in the C 
programming language HI 0911 where arrays are indexed from 0. 



6.3.1 Data structure and construction 

The FSIndex data structure consists of three arrays: frag, bin and Icp. The array 
frag contains pointers to each fragment in the dataset and is sorted by bin. The 
array bin, of size + 2 is indexed by the rank of each bin and contains the offset 
of the start of each bin in frag (the + 1-th entry gives the total number of 
fragments while the last entry is used solely for index creation). The bin ranking 
function r : Sq x Si x . . . Sm_i {0, 1 . . . , — 1} is defined as follows. For 
each i = 0, 1, ... m — 1 let : Sj {0, 1, . . . , |Sj| — 1} be a ranking function 
of Sj and define : S^ ^ N by 

m—l 

e.M = r.(a)niS.I- (6.1) 

j=i 

In the case i = m — 1 the empty product above is taken to be equal to 1. Then, 

m— 1 

rix) = Y,Ux^)■ (6-2) 

In addition, each bin is sorted in lexicographic order and the value of lcp[i] 
provides the length of the longest common prefix between frag [i] and frag [i — 1] ■ 
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The value of lcp[0] is set to 0. Figure [6^ depicts an example of the full structure 
of an FSIndex. 
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Figure 6.8: Structure of an FSIndex of a dataset of fragments of length 4 from the alpha- 
bet S = {A, B, C, D, E, F}. The same alphabet reduction is used at each position, mapping 
{A, B} to 0, {C, D} to 1 and {E, F} to 2. 



Remark 6.3.2. The arrays frag and Icp are inspired by suffix arrays but the order 
of offsets in frag is different because frag is first sorted by bin and then each bin 
is sorted in lexicographic order. Sorting frag within each bin and constructing 
and storing the Icp array is not strictly necessary and incurs a significant space and 
construction time penalty. The benefit is improved search performance for large 
bins, compensating for unbounded bin sizes. In effect, each bin is subindexed 
using a compact version of a PATRICIA tree. 

To construct the FSIndex data structure, any sorting algorithm can be used to 
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produce the frag array from which the bin and Icp arrays can be easily computed. 
Algorithm 16 . 3 . 1 1 outlines the reference implementation. 

The space requirement of FSIndex is Q{n + N). The exact space and time 
complexity of the construction algorithm depends on the sorting algorithm used 
for sorting the frag array. If the quicksort ll94l algorithm is used (the reference 
implementation), the space requirement is Q{n+N) and the running time is 0{n+ 
N + n log n) on average and 0{n + N + n'^) in the worst case. Using radix sort 
[I173II . the average and worst case running time can both be reduced to 0{n + N) 
with 0(n) (or O(logn)) additional space overhead. Another alternative is to use 
heapsort [121 111 to sort the frag array with the time complexity 0{n\ogn + N) 
but no additional space overhead. 

6.3.2 Search 

Search using FSIndex is based on traversal of implicit trees whose nodes are as- 
sociated with reduced fragments (bins). 

Definition 6.3.3. Let u = uqUi . . . Um~i G So x Si x . . . x S^-i. For any k = 
0, 1, . . . , m—1 and a G S^, denote by u{k, a) the sequence uq . . . Uk-icruj^^i . . . Um-i- 

Let i = 0, l,...,m — 1. Denote by , the tree having the root u connected 
to the subtrees Tu{k,a),k+i for all /c = i, i + 1, . . . , m — 1 and a G \ {uk} and 
by T„ the tree T„,o- A 

The trees Tu^i are connected and unbalanced and can be shown to have depth 
m — i while the root has the degree Yll^Zi l^fcl ~ 1- The tree topology is clearly 
independent of the choice of m. If |So| = = ... = |Sm-i| = K, T„ is 
isomorphic to the multinomial tree of order (m, K). \i K = 2, such tree is called 
the binomial tree of order m. An example is shown in the Figure 

The following Proposition is easily established. 

Proposition 6.3.4. Let Ej, i = 0, 1, . . . , m — 1 be finite sets and m G Sq x Ei x 

. . . X Tjm-i- Then there exists a bijection between the nodes ofT^ and the set 
So X Si X ... X Sm-i- □ 
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aaa 




bbb bbc bob bcc bdb bdc 

Figure 6.9: An example of where lu = aaa € Sq x Si x S2, Sq = {a, b}, Si = 

{a,5, c, d}, S2 = {a,b,c}. 



Retrieval of a quasi-metric range query ^^(ci;) using the implicit tree structure 
is conceptually straightforward. Given a query point cu and the radius e, map cu to 
its bin 7t{uj) and traverse the tree T^(uj) from the root. At each node u, calculate 
the distance d{u, u) and prune the subtree rooted at u if d{ijj, u) > e. For every 
visited node which is not pruned, calculate the distance to each fragment in the 
associated bin and collect all the fragments whose distance from uj is not greater 
than e. 

The indexing scheme providing the access method described above can be 
described as a query partitioning indexing scheme (Subsection 15.6.21) where the 
workload (S"^, X, Q™^) is partitioned into a union of valuation workloads (S"*, X, Q 
for each G fi, where d^{x) = d(uj,x). Each valuation workload is associated 
with the valuation indexing scheme J^^, defined as follows. The set of blocks is 
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So X Si X ... X Sm-i and the tree T consists of the tree T^(a;) where a leaf 
node corresponding to the same reduced sequence is attached to each node. The 
function g : T ^M. increasing on T is given by ^ 

g{t) = d{uj, t) = mmd{uj, y). 

yet 

It is clear that is indeed a valuation indexing scheme. The proposition l6.3.4l 
ensures that the number of leaf nodes is N while g is increasing on T because each 
child node is obtained by replacing one letter from the parent with another, differ- 
ent letter, an operation which increases the distance. Therefore, by the Theorem 
I5.5.4[ is a consistent indexing scheme and it follows that the query partitioning 
indexing scheme over (S™, X, Q™^) is also consistent. 

Unlike most published metric indexing schemes mentioned in Chapter[5l FSIn- 
dex does not have a balanced tree. Therefore, the expected average and worst-case 
search time complexity is 0{n + K) - the overhead is proportional to K, the 
number of inner nodes. So, based on these considerations, FSIndex is not scal- 
able for queries of a fixed radius. However, the performance can be to a large 
extent controlled by the choice of alphabet partitions and hence some scalability 
can be achieved by using more partitions for larger datasets in order to reduce the 
scanning time while incurring some additional overhead. 

6.3.3 Implementation 

Descriptions of FSIndex algorithms in this section are based on the reference im- 
plementation developed in the C programming language HI 0911 (some optimisa- 
tions are omitted for clarity). Table 16.1] shows the descriptions of all global vari- 
ables and functions used. 

'This is a slight abuse of notation because the tree T now has two distinct copies of each bin: 
one as an inner node and one as a leaf node attached to the inner node. The context should be clear 
nevertheless. 
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X 


Fragment dataset 


n 


Size of X - usually not known exactly beforehand 


m 


Fragment length 


S,- 

J 


Reduced alphabet at ?-th position 


J 


Projection at j-th position 


>J 


Integer value of a letter of reduced alphabet at ?-th position 


TT 


Projection function - maps each fragment into its bin 


N 


Total number of bins - N = Y\T~r,^ 


r 


Bin ranking function - index into bin array 


u 


Index of a bin -u = r(x) where a; is a bin 

V / 




Query fragment 


d 


Distance function 


e 


Search radius 


k 


Number of nearest neighbours to retrieve 


CD 


Cumulative distance array of length m + 1 used for processing each bin 


HL 


List of search results (hits) 


PQ 


Priority queue for kNN search 



Table 6.1: Variables and functions of FSIndex creation and search algorithms. 
Construction 

The construction algorithm (Algorithm 16.3 .11) is closely related to counting sort 
[I173L It makes three passes over data fragments: to count the number of frag- 
ments in each bin, to insert the fragments into the frag array and to compute the 
Icp array. It allocates the memory for the arrays after counting. 

The fragment dataset is in practice always obtained from a full sequence dataset 
by iterating over all subfragments of length m from each sequence and it is often 
necessary to verify each fragment and reject those that contain non-standard let- 
ters such as 'X', 'B' or 'Z' that do not represent actual amino acids and violate 
the triangle inequality for the score matrices. Therefore, the true number of data 
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points is not known before the first pass through the dataset. 
Search 

Range search (Algorithm 16.3.21) makes a recursive, depth-first traversal of the 
implicit tree implemented in the function CheckNode (Algorithm 16.3.31) . The 
function PROCESS Bin (Algorithm l6.3.4l) scans each bin associated with an inner 
node not pruned using the Icp array in order to reduce the number of computa- 
tions necessary to calculate distances to each member of the bin.^ The function 
InsertHit (omitted in the case of range search) inserts the neighbour into the list 
of search results. 

The search algorithm computes and stores the values of d(ujk, cr), 
min {d{ujk, cr) | cr G \ {irk{ujk)}} and ^ki-^ki^^k)) + ^ki^r) for all k and all a 
before tree traversal so that the CheckNode function uses a table lookup. 

The kNN search algorithms use branch-and-bound pTl l93l traversal involv- 
ing initially setting the radius e to a very large number (+oo), inserting first k data 
points encountered into the list of hits and then setting e to be the largest distance 
of a hit from a query. From then on, if a point closer to the query than the farthest 
hit is found, it is inserted in the list and the previous farthest hit is removed. Even- 
tually, the current search radius is reduced to the exact radius necessary to retrieve 
k nearest neighbours. 

The branch-and-bound procedure is implemented using a priority queue (heap) 
which returns the farthest data point in the list of hits (Table [6^ outlines the op- 
erations on priority queue). Most of the code for range search can be reused: it is 
only necessary to use a different InsertHit function involving a priority queue 
(Algorithm 16.3.61) and to initialise the priority queue in the main search function 
(Algorithm [633]). Algorithm l6.3.6l uses the final list of results HL as an auxiliary 
list to store those neighbours that have the same distance from the query as the 
farthest point in the priority queue. It copies the hits in the priority queue into HL 
after finishing the tree traversal. 

^Conceptually, Algorithm 16. 3. 41 is equivalent to depth-first traversal of a compact form of a 
PATRICIA tree for the set of fragments in the bin. 
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Algorithm 6.3.1: CreateFSIndex(X, m, iV, tt, r) 

bin ^ ALLOCATEMemORY ( + 2) 
bin[0] ^ 0,bin[l] ^ 
comment: Count bin sizes 

for each s e X 

{i 4— r(7r(s)) 
bin[i + 2] ^ bin[i + 2] + 1 
n <— n + 1 
for z ^ 2 to A^ + 2 

do bin[i] ^ bin[i] + bin[i — 1] 
comment: Insert fragments into bins 

frag ^ Alloc ateMemory (n) 
for each s e X 

{i ^ r(7r(s)) 
frag[bin[i + 1]] ^ s 
bin[i + 1] ^ bin[i + 1] + 1 
comment: Calculate longest common prefixes 

for z ^ to A^ 

do QuiCKSORT(/ra5([6m[2] : bin[i + 1]]) 
Icp ^ Alloc ATEMEMORY(n) 
lcp[0] ^ 
for j ^ 1 to n — 1 

k ^0,s^ frag[j - l],t ^ frag[j] 

while Sk = tk 

do < 

}cp\j] ^ k 
return {bin, frag, Icp) 
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'Algorithm 6.3.2: RangeSearch(u;, d, e) 

comment: Recursive tree traversal 

global bin, frag, Icp, tt, r, HL, CD 
Initialise list of hits HL 

Initialise cumulative distances CD, CD[0] ^ 

u ^ r(7r(u;)) 

ProcessBin(m) 

CheckNode(u, 0, 0) 

return {HL) 



Algorithm 6.3.3: CheckNode(m, D, i) 

comment: Recursive tree traversal 

global d, e,ij, ttj 

for j ^ m — 1 downto i 

if D + min {d{ujj, a) | cr G \ {TTj{ujj)}^ < e 
for each cr g \ {njiuj)} 
' E <- D + d{ujj,a) 
do ^ _ iiE <e 

do < (v ^u- ^k{T^j{ujj)) + ^j{a) 

then < PROCESSBiN(t;) 

[CHECKNODE(t', E,j + 1) 



then < 



The performance of the branch-and-bound algorithm depends on the order of 
nodes visited - it is to a great advantage if the nodes containing data points closest 
to the query are visited first so that the bounding radius becomes small early on. 
A frequently used solution [|4n 1931 is to traverse the tree breadth-first, keeping 
the nodes to be visited in a second priority queue, where the priority of a node is 
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Algorithm 6.3.4: PROCESSBlN(n) 

comment: Sequentially scan all entries. 

global d, e, HL, bin, frag, Icp, CD 
n ^ bin[u + 1] — bin[u] 
if n > 

then return 
for ? ^ to n — 1 

s ^ frag[u + i] 

for j ^ lcp[u + i] to lcp[u + z + 1] — 1 

do CD[j + 1] ^ CD[j] + d{ujj, Sj) 
if CD[lcp[u + i + l]] <e 

for j ^ lcp[u + z + 1] to m — 1 

do CD[j + 1] ^ CD[j] + d{ujj, Sj] 
if CD[m] < e 
then lNSERTHiT(ifL, s, CD[m]) 



do < 



then < 



given by the upper bound of the distance of its covering set from the query. 

The second priority queue is not used for the FSIndex based kNN search. 
Since the implicit tree is heavily unbalanced, the branches with smallest depth 
are visited first with a similar effect without the overhead of the second priority 
queue. The visiting order of nodes is ensured in the outer loop of the CheckN- 
ODE function where the index j starts at m — 1, decreasing to i (Algorithm l6.3.3l) . 
Since the order does not affect the range search performance, the same code can 
be used for range search. 



6.3.4 Extensions 

FSIndex as described so far provides an access method for workloads of fragments 
of fixed length with quasi-metric similarity measures. However, with minor mod- 
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'Algorithm 6.3.5: KNNSEARCH(a;, d, k) 

comment: Recursive tree traversal 

global e, bin, frag, Icp, ^j, n, r, HL, CD 
Initialise list of hits HL 

Initialise cumulative distances CD, CD[0] ^ 
Initialise priority queue PQ 

u riji^oj)) 
e ^ oo 

ProcessBin(u) 
CheckNode(m, 0, 0) 
Insert all hits from PQ to HL 
return (HL) 



PQ.SizeO 


number of items in the priority queue PQ 


PQ.Insert(s,p) 


inserts item s with priority p 


PQ.PeekO 


retrieves the item with highest priority and its priority 


PQ.RemoveO 


retrieves the item with highest priority and its priority 




and removes it from the queue 


Table 6.2: Priority queue operations. 



ifications it can be extended to fragment (suffix) datasets of arbitrary length and 
almost arbitrary similarity measures. 

Arbitrary fragment lengths 

In most practical situations, fragment datasets are datasets of suffixes of full se- 
quences. The FSIndex structure as is can be used without modifications for an- 
swering queries longer than m, the original length: each fragment of length m is 
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then < 



then 



Algorithm 6.3.6: lNSERTHiT(ifL, s, dist) 

comment: Hit insertion for kNN search. 

global k,e,PQ 
if PQ.SizeO < k 

PQ. Insert 
if PQ.SizeO = k 

sl,distl ^ PQ.PeekO 
e ^ distl 

else if dist < e 

' si, distl ^PQ. Remove 
PQ. Insert (s, dist) 
s2,dist2 ^ PQ.PeekO 
e dist2 
if distl = dist2 
then HL. Insert (s, dist) 
else HL. Clear 
else HL. Insert (s, dist) 



then < 



a prefix of a suffix of length m' where m' > m. To search with a query of length 
m', traverse the search tree using the first m positions and sequentially scan all 
the bins retrieved, using all m' positions to calculate the distance. If m' > m, the 
few fragments of length m at the end of each full sequence can be identified and 
ignored at the sequential scan step. 

Similarly, FSIndex can be used to answer queries centered on fragments of 
length m" where m" < m. At the construction step, insert all suffixes, in- 
cluding those of length less than m into the index by mapping each fragment 
X such that |x| = m" < m, into the bin vri(xi)7r2(x2) . . . 7im"{xm")o'm"+i ■ ■ ■ o'm, 
where crm"+i, . . . , 0"™ are chosen so that ^m"+i{(^m"+i) = ^m"+2{(^m"+2) = ... = 

^m(o-m) = 0. 
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To answer a query centered on uj such that Ic^l = m", traverse the search tree 
up to the depth m" and sequentially scan all the bins attached to subtrees rooted at 
the accepted nodes using first m" positions to calculate the distance. The ranking 
function given by the Equations 16.11 and 16.21 ensures that the bins that are the 
children of a given node are adjacent in the frag array. 

Arbitrary similarity measures 

FSIndex does not directly depend on a quasi-metric: it is constructed solely from 
alphabet partitions. While index performance strongly depends on the way the 
distance agrees with partitions, the same index can be used for any distance which 
is an £i-type sum. It is possible to make even further generalisations. 

Let i = 0, l,...,m — 1 and suppose Sj are finite alphabets and fi are arbitrary 
functions Sj M. Suppose F : Sq x . . . x S^^i — M is given by F(x) = 
Y.T=Q fii^i)- Let = min^gE^ fi{a), Zi = argmin„g2,/i(«) and let z denote the 
sequence zqZi . . . Zm-i G Sq x . . . x It is clear that the function Fq given by 

Fq{x) = F{x) — Xli^o^ Ci is increasing on the tree and therefore the FSIndex 
can be used to answer queries for any valuation workload or a union of valuation 
workloads. Important biological cases include PSSM or profile based similarities 
which are exactly ^i-type sums of real- valued functions at each position as well 
as any score matrix based similarity, whether or not the triangle inequality on the 
alphabet is satisfied. Note that the above statement applies only to consistency of 
the indexing scheme and not to the computational efficiency of query retrieval. 



6.4 Experimental Results 

This section describes the experiments on actual fragment datasets carried out to 
evaluate the performance of FSIndex. Three main classes of tests were conducted 
investigating general performance, effects of similarity measures and scalability. 
The final set of experiments compares performance of FSIndex to performances 
of suffix arrays M-Tree and mvp-tree. 
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Each experiment consisted of 5000 searches using randomly generated queries 
(Subsection l6.1.3l) . The main measures of performance are the number of bins and 
dataset fragments scanned in order to retrieve k nearest neighbours. The principal 
reason for expressing the results in terms of the number of nearest neighbours re- 
trieved rather than the radius was that it allows comparison across different index- 
ing schemes, datasets and similarity measures. Furthermore, most existing protein 
datasets are strongly non-homogeneous and the number of points scanned in order 
to retrieve a range query for a fixed radius varies greatly compared to the num- 
ber of points scanned in order to retrieve a fixed number of nearest neighbours. 
Nevertheless, most experiments involve range search algorithms, because they are 
generally more efficient and because in some cases no A;NN implementation was 
available. 

Other performance criteria were total running time (only shown where all ex- 
periments compared were performed on the same machine with similar loads) 
and the percentage of residues (letters) scanned out the total number of residues 
in all scanned fragments. The later statistic measures the effect of sub-indexing 
each bin using the suffix-array-like structure which involves 'partially' scanning 
each fragment with a help of the Icp array. The final statistic is access overhead, 
discussed in Section [5771 

The obvious reference algorithm, which was not run due to excessive running 
times for large datasets, is sequential scan of all fragments in a dataset. Most of 
the experiments were run on a Sun Fire[tm] 280R server (733 Mhz CPU). 

6.4.1 Datasets and indexes 

Experiments investigating general performance and effect of different similarity 
measures used overlapping protein fragment datasets derived from the SwissProt 
Release 43.2 of April 2004. Scalability experiments used, in addition to SwissProt, 
the datasets nr018K, nr036K, nr072K, and nr2 8 8K, obtained by randomly 
sampling 18, 36, 72 and 288 thousands of sequences respectively from the nr 
dataset (SwissProt fills the gap because it contains about 150,000 sequences). 
The experiments comparing FSindex to suffix arrays and mvp-tree used only the 
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nrOlSK dataset. 

Table [63] describes the instances of FSIndex used in the evaluations. Two 
instances (SPNAO 9 and SPNBO 9) were based on partitions that are not equal at 
all positions while the remainder had the same partitions at all positions. 



Index 


Dataset 


Partitions 


Fragments 


Bins 


SPEQ06 


SwissProt 


T,SA,N,ILV,M,KR,DE,Q,WF,Y,H,G,P,C 


53486349 


7529536 


SPEQ09 


SwissProt 


TSAN, ILVM, KR, DEQ, WFYH, GPC 


53478888 


10077696 


SPEQ12 




TSAN ILVM KRDEQ WFYHGPC 


53472161 


16777216 


nr01809 


nrOlSK 


TSAN, ILVM, KR, DEQ, WFYH, GPC 


6005750 


10077696 


nr03609 


nr036K 


TSAN, ILVM, KR, DEQ, WFYH, GPC 


11911191 


10077696 


nr07209 


nr072K 


TSAN, ILVM, KR, DEQ, WFYH, GPC 


23878523 


10077696 


nr28809 


nr288K 


TSAN, ILVM, KR, DEQ, WFYH, GPC 


95593618 


10077696 


SPNAO 9 


SwissProt 


KR,Q,E,D,N,T, SA, G,H,W,Y,F,P,C,ILV,M 


53478888 


10483200 






KR, Q, ED, N, T, SA, G, HW, YF , P , C , ILV, M 










KR, QED, N, TSA, G, HW, YF, P, C, ILVM 










KR, QEDN, TSA, G, HWYF, PC, ILVM 










KR, QEDN, TSA, G, HWYFPC, ILVM 










KR, QEDN, TSAG, HWYFPC, ILVM 










KRQEDN, TSAG, HWYFPC, ILVM 










KRQEDN, TSAG, HWYFPCILVM 










KRQEDNTSAG, HWYFPCILVM 






SPNB09 


SwissProt 


KR, QEDN, TSA, G, HWYF, PC, ILVM 


53476582 


8643600 






KR, QEDN, TSA, G, HWYF, PC, ILVM 










KR, QEDN, TSA, G, HWYF, PC, ILVM 










KR, QEDN, TSA, G, HWYF, PC, ILVM 










KR, QEDN, TSA, G, HWYFPC, ILVM 










KR, QEDN, TSAG, HWYFPC, ILVM 










KR, QEDN, TSAG, HWYFPC, ILVM 










KRQEDN, TSAG, HWYFPC, ILVM 










KRQEDN, TSAG, HWYFPCILVM 










KRQEDNTSAG, HWYFPCILVM 







Table 6.3: Instances of FSIndex used in experimental evaluations. The last two digits 
of the index name denote the length of reduced fragments. The indexes SPNAO 9 and 
SPNBO 9 use non-equal partitions at different positions (all shown) while the remainder 
were constructed using one partition for all positions (only one shown). 

The choice of amino acid alphabet partitions was mainly a result of practical 
considerations based on the BLOSUM62 quasi-metric (Figure [6.101) . It was not 
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possible to partition the alphabet in a way that all distances within partitions are 
smaller than distances between and hence the primary criterion was to have as 
high lower bound on distances from any possible query point to any partition but 
its own. The additional criterion was to balance to the greatest possible extent 
the sizes of bins and to avoid having too many empty bins which would introduce 
large overhead. Therefore, the number of partitions per residue was decreased 
with fragment length by amalgamating 'close' partitions. Some amino acids hav- 
ing very small overall frequencies, such as tryptophan ('W') and cysteine ('C'), 
were in some cased clustered together in order to reduce the total number of par- 
titions, even though their distances from and to any other amino acid are very 
large. 





T 


s 


A 


N 


I 


V 


L 


M K 


R 


D 


E 


Q W F 


Y H G P C 


T 





3 


4 


6 


5 


4 


5 


6 


6 


6 


7 


6 


6 


13 8 


9 10 


8 8 10 


S 


4 





3 


5 


6 


6 


6 


6 


5 


6 


6 


5 


5 


14 8 


9 9 


6 8 10 


A 


5 


3 





8 


5 


4 


5 


6 


6 


6 


8 


6 


6 


14 8 


9 10 


6 8 9 


N 


5 


3 


6 





7 


7 


7 


7 


5 


5 


5 


5 


5 


15 9 


9 7 


6 9 12 


I 


6 


6 


5 


9 





1 


2 


4 


8 


8 


9 


8 


8 


14 6 


8 11101010 


V 


5 


6 


4 


9 


1 





3 


4 


7 


8 


9 


7 


7 


14 7 


8 11 


9 9 10 


L 


6 


6 


5 


9 


2 


3 





3 


7 


7 


10 


8 


7 


13 6 


8 11101010 


M 


6 


5 


5 


8 


3 


3 


2 





6 


6 


9 


7 


5 


12 6 


8 10 


9 9 10 


K 


6 


4 


5 


6 


7 


6 


6 


6 





3 


7 


4 


4 


14 9 


9 9 


8 8 12 


R 


6 


5 


5 


6 


7 


7 


6 


6 


3 





8 


5 


4 


14 9 


9 8 


8 9 12 


D 


6 


4 


6 


5 


7 


7 


8 


8 


6 


7 





3 


5 


15 9 


10 9 


7 8 12 


E 


6 


4 


5 


6 


7 


6 


7 


7 


4 


5 


4 





3 


14 9 


9 8 


8 8 13 


Q 


6 


4 


5 


6 


7 


6 


6 


5 


4 


4 


6 


3 





13 9 


8 8 


8 8 12 


w 


7 


7 


7 


10 


7 


7 


6 


6 


8 


8 


10 


8 


7 


5 


5 10 


8 1111 


F 


7 


6 


6 


9 


4 


5 


4 


5 


8 


8 


9 


8 


8 


10 4 9 


9 1111 


Y 


7 


6 


6 


8 


5 


5 


5 


6 


7 


7 


9 


7 


6 


9 3 


6 


9 1011 


H 


7 


5 


6 


5 


7 


7 


7 


7 


6 


5 


7 


5 


5 


13 7 


5 


8 9 12 


G 


7 


4 


4 


6 


8 


7 


8 


8 


7 


7 


7 


7 


7 


13 9 


10 10 


9 12 


P 


6 


5 


5 


8 


7 


6 


7 


7 


6 


7 


7 


6 


6 


1510 


10 10 


8 12 


C 


6 


5 


4 


9 


5 


5 


5 


6 


8 


8 


9 


9 


8 


13 8 


9 11 


9 10 



Figure 6.10: BLOSUM62 quasi-metric. Distances within members of an alphabet pai^- 
tition used for constructing an index for fragments of length 9 used in experiments are 
greyed. 

The alphabet partitions from the Table 16.31 agree with the 'biochemical in- 
tuition' (i.e. the classification from the Table [TTTI based on chemical properties 
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of amino acids). For example, the clusters outlined in the Figure lOOl used for 
fragments of length 9 approximately correspond to polar uncharged, hydropho- 
bic, basic, acidic, aromatic and 'other' amino acids. The partition used for the 
fragments of length 12 is obtained by merging together acidic and basic as well 
as aromatic and 'other' clusters. An interesting fact is that in this case each of the 
the four clusters has a relative frequency very close to ^. 

Despite efforts to balance bin sizes, the distributions of bin sizes were strongly 
skewed in favour of small sizes in all cases (Figure [6?m shows one example) with 
many empty but also a few very large bins. Such distributions appear to follow 
the DGX distribution, a generalisation of Zipf-Mandelbrot law described by Bi, 
Faloutsos and Korn [21]. 




1 100 10000 

BIN SIZE 



Figure 6.11: Distribution of SPEQO 9 bin sizes (2,342,940 empty bins out of 10,077,696). 



6.4.2 General performance 

Figures 16.121 16.131 and 16.141 present selected statistics of search experiments for 
fragment lengths 6,9 and 12 respectively, consisting in each case of range queries 
retrieving 1, 10, 50, 100, 500 and 1000 nearest neighbours with respect to the 
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BLOSUM62-based fi-type quasi-metric. For each length, kNN searches were 
performed prior to range searches using the index that was expected to be the 
fastest in order to determine the search ranges for each random query fragment. 

6.4.3 Dependence on similarity measures 

While queries based on more than one similarity measure can be used on a sin- 
gle FSIndex, it is to be expected that similarity measures different from the one 
originally used to determine the partitions would have worse performance. To 
investigate the difference in performance for different BLOSUM matrices, range 
queries needed to retrieve 100 nearest neighbours of testing fragments of length 9 
were run using the index SPEQO 9 which was performing the best for the length 
9 in the previous experiment (Figure [6. 131) . In addition, searches were performed 
using the PSSMs (Section ITTI) constructed for each test fragment from the results 
of a BLOSUM62-based 100 NN search in order to gain an insight in the actual 
search performance using the PSSM constructed from the results of a previous 
search that could be used to plan the biological experiments in Chapter |71 Table 
I6.4l presents a summary of the results. 



Matrix 


Bins (%) 


Fragments (%) 


Residues (%) 


kNN Ratio 


BLOSUM45 


0.1004 


0.1230 


60.8850 


1.5004 


BLOSUM50 


0.0978 


0.1146 


61.0993 


1.4807 


BLOSUM62 


0.0957 


0.1194 


60.9394 


1.4689 


BLOSUM80 


0.1038 


0.1306 


61.1321 


1.4771 


BLOSUM90 


0.1111 


0.1539 


61.1010 


1.4733 


PSSM 


0.0707 


0.0869 


58.1547 


2.1805 



Table 6.4: Performance of the FSIndex SPEQO 9 with different similarity measures. The 
values shown are based on 100 NN queries of length 9. The columns denote the similaiity 
measure (matrix), percentages of bins, fragments and residues (as before the percentage 
is out of the total number of residues in scanned fragments) scanned and the ratio between 
the number of bins retrieved for kNN and range searches. 
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Figure 6.12: General performance of FSIndex for fragment dataset of length 6: (a) Me- 
dian radius of a ball containing k nearest neighbours; (b) Total running time for 5000 
searches; (c) Mean number of bins scanned; (d) Mean number of fragments scanned; (e) 
Percentage of residues scanned (out of total number of residues in fragments scanned); (f) 
Mean ratio between the number of bins retrieved for kNN and range searches. 
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Figure 6.13: General performance of FSIndex for fragment 
dataset of length 9: (a) Median radius of a ball containing k near- 
est neighbours; (b) Total running time for 5000 searches; (c) Mean 
number of bins scanned; (d) Mean number of fragments scanned; 
(e) Percentage of residues scanned (out of total number of residues 
in fragments scanned); (f) Mean ratio between the number of bins 
retrieved for kNN and range searches. 
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Figure 6.14: General performance of FSIndex for fragment 
dataset of length 12: (a) Median radius of a ball containing k near- 
est neighbours; (b) Total running time for 5000 searches; (c) Mean 
number of bins scanned; (d) Mean number of fragments scanned; 
(e) Percentage of residues scanned (out of total number of residues 
in fragments scanned); (f) Mean ratio between the number of bins 
retrieved for kNN and range searches. 
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6.4.4 Scalability 

Figure [6?T5] shows the resuhs of a set of experiments involving instances of FSIn- 
dex based on datasets of fragments of length 9 of different sizes (nr018K, nr036K, 
nr072K, SwissProt and nr288K). All indexes used the same alphabet par- 
tition (Table 16.31) and all queries were based on the BLOSUM62 ^i-type quasi- 
metric. Unlike the Figures [6. 12[ 16.131 and 16.141 Figure [6T5] does not contain the 
total running time graph because the experiments were performed on different ma- 
chines but instead includes a plot showing the total number of residues scanned 
against the database size. This graph indicates the dependence of the performance 
of (an example of) FSIndex on dataset size, that is, its scalability. 

6.4.5 Access overhead 

Figure [6?T6] summarises some of the results of Sections [6A2]and[6A4]by showing 
the average access overhead (Definition 15.7.41) . that is, the average ratio between 
the number of fragments scanned and the number of true neighbours retrieved, 
for all combinations of indexes and fragment lengths available. Range search 
algorithm and the BLOSUM62-based £i-type quasi-metric were used in all cases. 

6.4.6 Comparisons with other access methods 

The final set of experiments compares FSIndex with M-tree, mvp-tree and suffix 
arrays. In general, other methods take significantly more space and time compared 
with FSIndex and it was therefore necessary to restrict the comparisons to small 
datasets and queries retrieving fewer neighbours. 

M-tree 

Recall that M-tree is a paged metric access method that stores the majority of the 
structure in secondary memory, usually on hard disk. This is in contrast with the 
implementations of FSIndex, mvp-tree and suffix arrays used here, which store the 
whole index structure in primary memory. Hence, although M-tree occupies large 
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Figure 6.15: Performance of FS Index for fragment datasets of 
length 9 of different sizes: (a) Median radius of a ball containing 
k nearest neighbours; (b) Scalabihty. Each line depicts a different 
number of nearest neighbours; (c) Mean number of bins scanned; 
(d) Mean number of fragments scanned; (e) Percentage of residues 
scanned (out of total number of residues in fragments scanned); 
(f) Mean ratio between the number of bins retrieved for kNN and 
range searches. 
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Figure 6.16: Average access overhead of searches using FSIndex. 



amounts of space, most of the costs are associated with the secondary memory, 
which is much less expensive. On the other hand, 1/0 costs, not considered here, 
can be quite large. 

The experiments described below were performed earlier than the other ex- 
periments presented in the present Chapter, using the resources from the High 
Performance Computing Laboratory (HPCVL), a consortium of several Canadian 
universities that the thesis author had the fortune to access during his visits to 
University of Ottawa. M-tree was not tested directly but as a part of the FMTree 
structure (Example 15. 6. II) that allows use of metric indexing schemes for retrieval 
of quasi-metric queries. 

The FMTree structure consisted of an array of M-trees with additional data de- 
scribing the score matrix and the distribution self- similarities. FMTree was con- 
structed by splitting the dataset into fibres and indexing each fibre separately using 
an instance of M-tree that was created using the BulkLoading algorithm of Ciaccia 
and Patella [[38l . To perform a range search, the FMTree range search algorithm 
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queries all M-trees associated with fibres as described in the Example 15.6.11 and 
collects the hits to produce the answer to the query. The M-tree implementation 



was obtained from its authors' site: |http : //www-db . dels . unibo . it /Mtree/ index . html 




Figure 6.17: Performance of FMTree based on M-tree on a dataset of fragments of length 
10. Average (median) and worst case results for 100 random queries ai^e shown. Enor bars 
show the interquartile range. 



The dataset in this experiment was the set of 1,753,832 unique fragments frag- 
ments of length 10 obtained from a 5000 protein sequence random sample taken 
from SwissProt (Release 41.21). An FMTree was generated for BLOSUM62 ii- 
type quasi-metric at a cost of 34,142,940 distance computations. Figure lOT] 
shows the results based on 100 random queries (unfortunately, mostly due to I/O 
costs, each search took over 1 minute and it was necessary to use a smaller number 
of runs). 
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Suffix arrays and mvp-tree 

Table [631 presents the results of comparisons between FSIndex {kNN and range 
search algorithm), suffix array and mvp-tree over the datasets of fragments of 
length 6 and 9 from nr 1 8K. The similarity measure used was the associated met- 
ric to the BLOSUM62 £i-type quasi-metric because mvp-tree is a metric access 
method and the performance of FSIndex does not much differ if a quasi-metric is 
replaced by its associated metric. If the mvp-tree showed good performance on 
metric workloads, the next step would be to split the datasets into fibres to create 
an FMTree for quasi-metric searches. 

Instances of suffix array were constructed using the routines published at |http : / /www . cs . dartrri 
The search algorithm was identical to the Algorithm l6. 3.41 where the input is a sin- 
gle bin containing all fragments in the dataset. In order to construct an instance 
of mvp-tree, duplicate fragments in the datasets were collected together and the 
sets of unique fragments provided to the mvp-tree construction algorithm. The 
mvp-tree implementation, developed by the original authors of mvp-tree ll25l . was 
kindly provided by Marco Patella and modified for use with protein fragments by 
the thesis author. The maximum size of a leaf node was set to be 5. 



Length 


Neighbours 


FSIndex 

(/cNN) 


FSIndex 
(range) 


Suffix array 


mvp-tree 


6 


1 


15.0 


9.9 


20130.7 


7598.5 


6 


10 


12.1 


7.1 


3761.1 


6229.5 


9 


1 


1869.7 


1303.6 


72351.1 


1016181.1 


9 


10 


902.6 


615.4 


14827.2 


214032.5 



Table 6.5: Comparison of performance of FSIndex, suffix array and mvpt-tree. The 
table shows the values of the effective access overhead, that is the number of characters 
(residues) accessed in order to retrieve a given number of nearest neighbours, normalised 
by the fragment length and the number of retrieved neighbours. The statistics are in terms 
of characters rather than data points because suffix array search algorithm passes by each 
point but only computes the distances if necessary. 
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6.5 Discussion 

While the experiments presented in Section 16.41 covered very few datasets and 
a small proportion of possible parameters for FSIndex creation, it can still be 
observed that FSIndex performed well. Not only did it perform much better than 
the other indexing schemes tested but it has proven itself to be very usable in 
practice: it does not take too much space (5 bytes per residue in the original 
sequence dataset plus a fixed overhead of the bin array), considerably accelerates 
common similarity queries and the same index can be used for multiple similarity 
measures without significant loss of performance. The remainder of the current 
section will examine some salient features of the experimental results. 

6.5.1 Power laws and dimensionality 

The most striking feature of the Figures [6.12[|6.13l and l6.14l is the apparent power- 
law dependence of the total running time, the number of bins scanned and num- 
ber of bins scanned on the number of actual neighbours retrieved, manifesting 
as straight lines on the corresponding graphs on log-log scale. For each index, 
the slopes of of the three graphs (i.e. running time, bins scanned and fragments 
scanned) are very close, implying that the same power law governs the depen- 
dence of all three variables on the number of neighbours retrieved. The exponents 
are 0.81 for length 6, between 0.57 and 0.63 for length 9, and about 0.45 for 
length 12. While a rigorous theory, especially in the context of quasi-metrics, is 
still missing, it is possible to offer an intuitive explanation for this phenomenon. 

Clearly, the graphs in question show the average growth of a ball in the pro- 
jection 7r(S'") against the growth of a ball same radius in the original space E"^. 
Denote by k the number of true neighbours retrieved and by V{k) the correspond- 
ing number of fragments scanned. The power relationship then can be written as 
V{k) = 0{k^^). If we accept the reasoning behind the distance exponent (not 
obvious from the data and not justified except for very small radii - see Appendix 
1X1) . that is that k = 0{r^^) where D2 is the 'dimension' of the space, it follows 
that V{r) = 0(r^i^2) Using the same reasoning about the size of the ball in the 
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projection (but note that the distance in the projection need not satisfy the triangle 
inequality), we conclude that the 'dimension' of the projection is D1D2, that is, 
the original dimension D2 is reduced by a factor Di. Assuming that the values of 
the distance exponent do not depend on whether a quasi-metric or its associated 
metric is used and taking the values of distance exponent estimated in Subsection 
16.1.61 the 'dimension' of the projected space is close to 6.5 for both length 6 and 
length 9. 

6.5.2 Effect of subindexing of bins 

PATRICIA-like subindexing of bins was introduced in order to accelerate scan- 
ning of bins containing many duplicate or highly similar fragments. Figures [6. 121 
I6.13[ 16.141 and 16.151 (Subfigure (e) in each case) show that there are two main 
factors influencing the proportion of residues scanned out of the total number of 
residues in the fragments belonging to the bins needed to be scanned: the (av- 
erage) size of bins and the number of alphabet partitions at starting positions. 
Instances of FSIndex having many partitions at first few positions perform well 
(SPEQO 6, SPNAO 9), those that have few partitions with many letters per parti- 
tion, less so. 

Clearly, if a bin has a single letter partition at its first position, the distance at 
that position need be only retrieved once, at the start of the scan, independently of 
the number of fragments the bin contains. The effects for the second and subse- 
quent positions are less prominent, if only for the reason that using many partitions 
would result in many bins being empty. The actual composition of the dataset is 
also important, as Figure [6?T5] (e) attests: although same partitions are used and 
nr02 8 8K is almost twice as large, SPEQO 9 scans fewer characters. The possi- 
ble reason lies in the nature of SwissProt, which, as a human curated database, 
is biased towards the well-researched sequences which are more related among 
themselves while not necessarily being representative of the set of all known pro- 
teins. On the other hand, nr0288K is a random sample from the nr database 
which is exactly the non-redundant set of all known proteins. 

The actual proportion varies from 30% (SPEQO 6, length 6) to over 85% 
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(nrOlSK, length 9). The percentage of characters scanned grows slowly with 
increase of the number of neighbours retrieved - most probably this is because 
the number of bins accessed also grows, requiring that at least one full sequence 
is scanned. 

To summarise, subindexing of bins does produce some savings, the exact 
amount depending on the dataset and alphabet partitioning. However, and this 
is further attested by poor performance of pure suffix array compared to FSIndex 
(Table 1631) . the good performance of FSIndex is mostly due to alphabet partition- 
ing. 

6.5.3 Effect of similarity measures 



Table 16.41 indicates very little difference in performance of the same instance of 
FSIndex with respect to different similarity measures. This should not be a sur- 
prise because the BLOSUM matrices are indeed very similar, modelling the same 
phenomenon in slightly different ways but generally retaining the same groupings 
of amino acids. The PSSM-based searches also performed well, mainly because 
the PSSMs are usually constructed out of sets of sequences that are strongly con- 
served at least in one or two positions, and hence, in those positions, the 'dis- 
tances' to all other clusters are so large that many branches of the implicit search 
tree can be pruned. 

6.5.4 Scalability 

Figure 16.151 (b) indicates that FSIndex is scalable with respect to the number of 
nearest neighbours retrieved - the number of residues needed to be scanned grows 
sublinearly with dataset size (in fact, the exponent is 0.25 to 0.3). The exponent 
for the growth of the number of scanned points (graphs not shown in any figure) 
is about 0.4, indicating that using PATRICIA-like structure improves scalability. 
The principal reason for sublinear growth of the number of items needed to be 
scanned is definitely that search radius decreases with dataset size (Figure 16.151 
(a)). Unfortunately, the results in terms of search radius are not available and it 
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is not possible to examine the scalability with respect to a fixed radius although 
theoretical considerations imply that the growth would be linear. However, it may 
be that subindexing of bins would bring an appreciable sublinear behaviour in this 
case as well. 

6.5.5 Comparison with other indexing schemes 

Results of Subsection 16 .4 . 61 indicate that FSIndex decisively outperforms all other 
indexing schemes considered. M-tree performed the worst, needing to scan 1 .3 
million fragments of length 10 in order to retrieve the nearest neighbour. The per- 
formance of mvp-tree is not much better, taking into account the dimensionality: 
it requires scanning about 1 million fragments of length 9 to retrieve the nearest 
neighbour. Suffix array was generally performing better than mvp-tree, except for 
retrieving the nearest neighbour of length 6. 

In the case of suffix arrays, it is clear that large alphabet and relatively small 
dataset (Figure 16.11) are responsible for relatively poor performance. Also note 
that suffix trees (and hence suffix arrays) generally are not good approximations 
of the geometry with respect to £i-type distances - two fragments lacking a com- 
mon prefix may have a small distance. It should be noted that performance of 
suffix array based scheme appears to improve with fragment length compared to 
FSIndex. 

The poor performance of M-tree and mvp-tree is somewhat surprising because 
Mao, Xu, Singh and Miranker II131II have recently proposed using exactly M-tree 
for fragment similarity searches. However, on closer inspection, several differ- 
ences appear. First, Mao, Xu, Singh and Miranker use a different metric. More 
importantly, they use a significantly improved M-tree creation algorithms. Fi- 
nally, if their results are compared with those from Figure lOT] (this can be done 
at least approximately because the same fragment length was used and the size of 
the yeast proteome dataset used in [I131II was very close to the size of SwissProt 
sample used in our experiment), it appears that there is no more than 10-fold 
improvement. While this is quite significant, the total performance appears still 
worse than that of FSIndex. For more detailed comparisons it would be neces- 
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sary to obtain the code of the improved M-tree from II131II and run a full suite of 
comparison experiments. 



Chapter 7 

Biological Applications 



The present chapter introduces the prototype of the PFMFind method for identi- 
fying potential short motifs within protein sequences. PFMFind uses the FSindex 
access method to query datasets of protein fragments. 

7.1 Introduction 

Most of the widely used sequence-based techniques for protein motif detection 
depend on regular expressions (deterministic patterns) Ill76[[26l . profiles (PSSMs) 
[1781161 or profile hidden Markov models 111161 [53l . As outlined in Chapter [3l a 
PSSM is constructed by taking a set of protein fragments/ constructing a multiple 
alignment, estimating the positional distributions of amino acids and producing 
positional log-odds scores for each amino acid. A PSSM can then be used to 
search a sequence dataset in order to identify new sequences fitting the profile 
(that is, its underlying positional distribution). This procedure can be performed 
iteratively, using sequences retrieved in one iteration to construct a profile for 
the subsequent one. Profile hidden Markov models generalise profiles by also 
modelling the distributions of gaps found in the multiple alignments (see Chapter 
5 of the book by Durbin et al. Il52l ). 

'Fragments are usually used rather than full sequences because the motifs are associated with 
domains, which are by their nature local. 
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The initial set of sequences consists of known examples of the motif in ques- 
tion. It can be obtained from results of laboratory investigations, from alignments 
of structures (for example using the SCOP database [|7l) or from results of se- 
quence similarity searches. PSI-BLAST L6J uses the latter approach: it searches 
a protein dataset using a score matrix such as BLOSUM62 and uses the results 
to construct a multiple alignment and produce a profile for the second iteration. 
Subsequent searches are based on profiles constructed from the results retrieved 
in the preceding iteration. Variations to this basic approach are possible, mostly 
involving the choice of dataset and weights of sequences used for profile construc- 
tion [I167II . The performance of any particular technique is measured by its ability 
to retrieve relevant items from the database (sensitivity) and to retrieve only such 
items (selectivity). 

The focus of the present investigation is short protein fragments of lengths 
7-15 with the aim to develop new bioinformatic tools for discovery of relation- 
ships between protein fragments that cannot be necessarily found when consider- 
ing longer fragments. Such relationships need not imply a common ancestor but 
could have arisen from convergence. The motifs discovered should correspond to 
a conserved function and should give an insight into a possible origin of such a 
function. 

Watt and Doyle II204II recently observed that BLAST is not suitable for identi- 
fying shorter sequences with particular constraints and proposed a pattern search 
tool to find DNA or protein fragments matching exactly a given sequence or a pat- 
tern^ I propose here an alternative technique, named PFMFind (PFM stands for 
Protein Fragment Motif) that involves the use of full similarity search with almost 
arbitrary scoring schemes and iterated searches closely resembling PSI-BLAST. 
It differs from PSI-BLAST in that it uses a global ungapped similarity measure 
over the fragments of fixed length (referred to as an £i-type sum in the Chapter 
|3]) allowing use of FSindex as a subroutine. The similarity score being ungapped 
could affect sensitivity but one should note that gapped alignments of short frag- 



^ A "pattern" in the sense of Watt and Doyle is a group of "target sequences", which are essen- 
tially regular expressions. 
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ments, at least of lengths not greater than 10, are often statistically insignificant 
if the usual gap penalties are used (for example, BLAST uses 1 1 as gap opening 
penalty, which is larger than the cost of any single substitution - in fact two to 
three conservative substitutions can be usually had for that cost, depending on the 
exact score matrix). It is also possible to examine several fragment lengths thus 
compensating for the similarity being global rather than local. Of particular bio- 
logical interest are cases where certain relationships can be found at a particular 
fragment length and not the others indicating a strongly conserved short motif that 
cannot be extended to a longer one. 

The present chapter contains the description of the current PFMFind algo- 
rithm together with six case studies based on SwissProt ll23l query sequences. The 
query sequences (SwissProt accessions in brackets) are: prion protein 1 precur- 
sor (PrP) (P10279), /5-casein precursor (P02666), K-casein precursor (P02668), 
/?-lactoglobulin precursor (P02754), cytochrome P450 llAl mitochondrial pre- 
cursor (cholesterol side-chain cleavage enzyme) (POO 189), and sensor-type histi- 
dine kinase prrB (Q 105 60). The first five sequences are bovine (Bos taurus) while 
the histidine kinase is from Mycobacterium tuberculosis. 

The PrP protein is found in high quantity in the brain of humans and animals 
infected with transmissible spongiform encephalopathies (TSEs). These are de- 
generative neurological diseases such as kuru, Creutzfeldt- Jakob disease (CJD), 
Gerstmann-Straussler syndrome (GSS), scrapie, bovine spongiform encephalopa- 
thy (BSE) and transmissible mink encephalopathy (TME) ll2T9l l220l [T59l lIOTll 
that are caused by an infectious agent designated prion. While many aspects of 
the role of PrP in susceptibility to prions are known, its physiological role and the 
pathological mechanisms of neurodegeneration in prion diseases are still elusive 

Caseins are major mammalian milk proteins involved in determination of the 
surface properties of the casein micelle which contain calcium and have major role 
in mammalian neonate nutrition [I137II . Bovine milk contains four different types 
of casein: a-Sl-, a-S2-, (3- and k-. Caseins are expressed in mammary glands, 
secreted with milk and following digestion may give rise to bioactive peptides 
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/5-Lactoglobulin is another major component of milk. It is the primary compo- 
nent of whey, binds retinol and unlike the caseins, has a well-defined conformation 
HI 2011 containing an eight-stranded continuous /^-barrel and one major a-helix. 

Cytochromes P450s are a superfamily of heme-containing enzymes involved 
in metabolism of drugs, foreign chemicals, arachidonic acid, eicosanoids, and 
cholesterol, synthesis of bile-acid, steroids and vitamin D3, retinoic acid hydrox- 
ylation and many still unidentified cellular processes II145L The cytochrome P450 
All is a mitochondrial, enzyme coded by the CYPllAl gene and catalyses a 
cholesterol side cleavage chain reaction [98]. 

Histidine kinases phosphorylate their substrates on histidine residues and have 
been well-characterised in bacteria, yeast and plants [I215II . with a variety of func- 
tions including chemotaxis and quorum sensing in bacteria and hormone-dependent 
developmental processes in eukaryotes. They are also present in mammals lfT9ll . 
Typically, histidine protein kinases are transmembrane receptors with an amino- 
terminal extracellular sensing domain and a carboxy-terminal cytosolic signaling 
domain and do not show significant similarity to serine/threonine or tyrosine pro- 
tein kinases although they might be distantly related HUSH . 

The query sequences were chosen mainly according to the interests of the au- 
thor and his supervisors. For example, caseins have no known function apart from 
nutrition while being strongly conserved in mammals, leading to questions about 
their origins. Cytochromes P450 form a large and well-researched superfamily 
with many examples in SwissProt and TrEMBL, thus being particularly suitable 
for the PFMFind approach. Histidine kinases are a subset of the class of pro- 
tein kinases while being very distantly related to the remainder of the class. PrPs 
are involved a well-publicised set of neurological diseases and have a relatively 
unusual structure of aromatic-glycine tandem repeats ll68l . 
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7.2 Methods 

7.2.1 General overview 

PFMFind takes a full sequence of interest and divides it into all overlapping frag- 
ments of a given fixed length. For each fragment, it uses FSindex-based range 
search to find the set of statistically significant neighbours from a protein fragment 
dataset with respect to a general similarity scoring matrix such as BLOSUM62. 
All fragments that have fewer significant neighbours than a given threshold are ex- 
cluded from further iterations. For each fragment where the number of significant 
results is sufficiently large, it constructs a PSSM from the results and proceeds 
with the next iteration. The procedure is repeated several times, each time using 
the results of one iteration, if their number is over the threshold, to construct the 
profile for the next search. 

As in PSTBLAST, the measure of statistical significance is E-value, the ex- 
pected number of fragments similar to a given query fragment under the assump- 
tion that amino acids in a protein fragment are independently and identically dis- 
tributed. Sub section 17 . 2 . 3 1 below describes the derivation and computation of the 
distribution of similarity scores with respect to a given query fragment and simi- 
larity measure. The E-value threshold decreases with iterations. This is because 
preliminary investigations have shown that too few results of the initial, general 
score matrix-based search, are significant under the model from Subsection l7.2.3l 
at a level usually set in bioinformatics applications of a similar kind (for example, 
in PSI-BLAST, the inclusion threshold E-value is 0.005) while the hits having E- 
value up to 1 .0 clearly belonged to the same protein (in a different species) as the 
query protein. In the iterations using profiles, more stringent significance levels 
have led to expected results. 

7.2.2 PSSM construction 

Since the fragment length is fixed, a collection of fragments directly corresponds 
to an ungapped multiple alignment. Therefore, the first nontrivial step is assign- 
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ing a weight to each sequence in order to compensate the possible bias of the 
set of hits caused by over- and under- representation of a particular sequence. 
While each sequence is assigned a new weight, the total weight of the fragment 
set remains the original number of hits. The current version of PFMFind uses the 
weighting scheme proposed by Henikoff and Henikoff [l90l| . which gives smaller 
weight to well-represented sequences and is computationally simple. The second 
step involves obtaining the 'observed' (given the weights) frequencies of amino 
acids at each position and combining them with mixtures of Dirichlet priors in a 
way described by Sjolander and others HI 7411 (see also Chapter 5 of ll52l ). The 
contribution of Dirichlet priors decreases with sample size, preventing overfitting 
the profile to a small sample while leaving the distribution derived from a large 
set essentially unchanged. Finally, the procedure calculates log-odds similarity 
scores to be used for searches. The scores are multiplied by two (that is, scaled 
to half -bit units) and converted to integers, enabling direct comparison with the 
BLOSUM62 scores which are also in half-bit units. 

7.2.3 Statistical significance of search results 

To evaluate the statistical significance of a particular similarity score and therefore 
an alignment associated with it, we estimate how probable that score is given a 
null, or background hypothesis. In this case, we assume as a null hypothesis that 
fragments are generated by the independent, identically distributed process where 
the probability of each amino acid is given by its relative frequency in the dataset 
(Subsection 16. 1 .31 discusses this and an alternative model of protein sequences). 
Let m be the fragment length. For each z = 0, 1 , . . . , m — 1, let : S ^ M be the 
score function at position i. If the similarity measure is given by a score matrix 
s : S X S ^ M, we have Si{a) = s{uji, a) where cu = coo^i ■ ■ ■ ^m~i is the query 
fragment and a E S,while in the case of a PSSM Si is the score function at its i-th 
position. 

By our assumptions, it is clear that {Si}^^ is a collection of independent 
random variables and that the similarity score S* of a fragment x is given by the 
sum of the values S'j(xj) for each i. Hence, the density of S, denoted by fs is 
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given by the convolution of the densities fs- of the random variables Si, that is 



By the well-known Convolution Theorem, the Fourier transform of the convolu- 
tion of a collection of functions is a product of their Fourier transforms. Since the 
functions in questions are discrete, the efficient way of computing fs is to com- 
pute the discrete Fourier transforms of fs^ for each i, multiply them together and 
take the inverse discrete Fourier transform of the product, all using the FFT (Fast 
Fourier Transform) algorithm (the book by Smith [I175II provides a good reference 
about signals, convolutions and Fourier Transforms) and is freely available on the 
web). 

Once the density of similarity scores is obtained, it is straightforward to com- 
pute the p-value of each score T, that is the probability that a random score X 
is greater than T. The number of fragments in the dataset expected by chance to 
be equal to or exceed T, also known as E-value, is obtained by multiplying the 
p-value by the size of the dataset. The relationships represented by the search 
hits where the E-value of the similarity score is very low (usually << 1) are con- 
sidered unlikely to have arisen by chance and therefore statistically significant. 
The significance cutoff can be computed prior to search so that search by E-value 
reduces to range search. 

7.2.4 Implementation 

PFMFind is implemented in the Python programming language [I195II , access- 
ing the FSindex library, which is written in the C programming language [I109II . 
through the SWIG [|TT| interface. The PFMFind code uses the routines from the 
Python standard library I1128II as well as from the Biopython [I186L Numeric flU 
and Transcendental ll46l packages. 

Architecturally, PFMFind system consists of a master server, several slave 
servers and at least one client, all communicating through TCP/IP sockets. The 



fs = fso * fsi * ■ ■ ■ fs, 



'm-1 



where 
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master server handles computation of searches and statistical significance by dis- 
tributing the load to slave servers while the client is responsible for storage of 
results and computation of profiles.'' Python programs making use of PFMFind 
create an instance of a client, connect to a master server and provide the param- 
eters of desired searches. A graphical user interface, called FragToolbox, was 
written using the Tkinter module [TTTI from the Python standard library in order to 
facilitate the analysis of the results by displaying them in a human-usable format. 

The above configuration is necessary in order to use large datasets which can- 
not fit into memory of a single machine. It also opens the possibility of paralleli- 
sation of most of computation, leaving only storage and display to clients. 

7.2.5 Experimental parameters 

Dataset 

Preliminary investigations using SwissProt as the database have shown that in 
most cases too few sequences are available in order to be able to construct good 
profiles even if the initial E-value is relaxed. While SwissProt is manually anno- 
tated and therefore provides most confidence in functional annotation, it is also 
biased in favour of well-researched sequences. I therefore decided to use the 
full Uniprot ifTOl dataset consisting of SwissProt together with TrEMBL (trans- 
lated EMBL DNA sequence dataset). Since the size of Uniprot is large (Release 
3.5 that was used together with alternative splicing forms of some proteins had 
556,628,177 amino acid residues in 1,737,387 sequences), it was necessary to di- 
vide it into 12 SwissProt-sized parts and to run a PFMFind slave server for each 
part on a different machine. 

Search and profile construction parameters 

The cutoff E-values were 1.0 for the first and second, 0.1 for the third and fourth 
and 0.01 for all subsequent iterations. As preliminary investigations indicated that 

^It is planned to move the profile construction to the server side as well leaving only the storage 
and interface to the client. 
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at E-value thresholds of 1 .0 or smaller most BLOSUM matrices produce similar 
results, my choice was to use BLOSUM62 in the first iteration. Profile construc- 
tion algorithm used the Dirichlet mixture recodeS . 2 0comp downloaded from 
the web site|http : / / www . cse . ucsc . edu /re search/ compbio/ dirichlet s/ 



of some of the authors of [I174II . They recommend the 

re code 3 . 2 Ocomp mixture as the best to be used with close homologs. After sev- 
eral trials I set the number of hits necessary to proceed with the next iteration to 
30 as a compromise between the need to have as large number of hits as possible 
in order to have a good profile and the average number of neighbours given the 
required statistical significance. 



7.3 Results 

The full PFMFind algorithm was run for the six test sequences. Fragment lengths 
8 to 15 were considered for all test proteins except PrP where only fragments of 
length 8 were considered because of technical limitations: too many hits were 
encountered and the available memory was insufficient to store all but the length 
8 results (there were usually more than 100 hits for each overlapping fragment, 
sometimes over 1000 hits). The hits were almost exclusively exact matches to 
fragments of the query sequence or other prion proteins, in the same or different 
species. PrP is glycine rich and contains several repeats which manifested as 
several hits to the same protein in a single fragment search. 

The running time for searches for all the examples was in the order of one 
to two hours, using 12 Intel® Pentium® IV 2.8 GHz machines running in paral- 
lel, with indices optimised for lengths 10 and 12. Running FSindex did not take 
more than half of that time, the remainder being taken by calculation of statistical 
significance, construction of profiles, communication between machines and I/O 
operations. 

Table 17.11 provides the summary of the results for all examples except PrP. 
The 'Region' column denotes the region of the original query sequence where 
significant hits to database proteins were found and usually refers to the maximal 
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extent of such region for the longest fragment length where hits were found. The 
'Feature' column contains the annotations of the region in question taken from 
SwissProt and InterPro [|141| . a database of protein families, domains and func- 
tional sites consisting of several member databases using a variety of motif-finding 
techniques. The last column includes the description of the major categories of 
proteins found in the hits. Some of the K-casein hits are not included because they 
were difficult to characterise (no SwissProt entry present). 



/3-casein precursor [Bos taurus] (P02666) 



Region 


Lengths 


Feature 


Major classes of hits 


1-18 


8-15 


signal peptide 


a-Sl-, a-S2-, /3-, 7-, e- casein, amelogenin (only 4—18) (all 
hits to signal peptide region); 


3-15 


11 


signal peptide (po- 
tential) 


vitellogenin (signal peptide) 


3-17 


12-13, 15 


transmembrane (po- 
tential) 


cation-, heavy metal- transporting ATPase 


3-14 


11-12 




cytochrome b 


158-173, 
182-200 


12-15 




proline, glutamine and alanine rich fragments from various 
proteins, repeats 



K-casein precursor [Bos taurus] (P02668) 



Region 


Lengths 


Feature 


Major classes of hits 


30-191 


8-15 


full mature protein 


K- casein 


110-133 


13-15 




histidine rich fragments from various proteins 


139-166 


13-15 




threonine rich fragments from various proteins 


32^6 


14-15 




self-incompatibility ribonucleases 


31^5 


15 




myosin 


174-188 


15 




Kluyveromyces lactis strain NRRL Y-1 140 chromosome E (ap- 
parently a repeat) 


80-95 


12-15 


part of casoxin B 


bacterial aldehyde dehydrogenase 


55-67 


13-14 


includes casoxin A 


Erythrocyte membrane protein (Plasmodium falciparum) 


51-63 


13 


includes casoxin A 


extracellular' region of bacterial regulatory protein blaRl 


155-167 


13 




bacterial sulfate adenylyltransferase 
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/3-lactoglobulin precursor [Bos taurus] (P02754) 



Region 


Lengths 


Feature 


Major classes of hits 


25-39 


12-15 


turn, helix, strand 


/3-lactoglobulin, outer membrane lipoproteins, plasma retinol- 
binding protein, glycodehn, recA, SbnH (length 12 only) 


54-68 


14-15 


turn, strand, turn 


/3-lactoglobuUn, glycodelin 


58-72 


14-15 


strand, turn, strand 
(part) 


glucose- 1 -phosphate thymidylyltransferases, /3-lactoglobulin 


110-124 


14 


strand 


/3-lactoglobuHn, glycodehn, bacterial DNA methylase 



Cytochrome P450 All mitochondrial precursor [Bos taurus] (P00189) 



Region 


Lengths 


Feature 


Major classes of hits 


77-86 


9-10 


turns 


cytochrome P450 llAl, formyltetrahydrofolate synthetase 


85-99 


12,15 


turn, helix, turn, he- 
lix 


various cytochromes P450 


119-135 


13-15 


contains a turn 


cytochrome P450 (llAl and 11B2), serine/threonine-protein 
kinases Pim-2 and Pim-3 (kinase domain, length 14), trans- 
posase (lengths 13-14), various other proteins 


260-273 


12-14 


helix 


cytochromes P450 (mostly llAl and 11B2) 


311-343 


11,13-15 


helix, turn, helix 


various cytochromes P450 (few hits at length 14) 


343-356 


14 


helix 


cytochrome P450 llAl 


370-396 


9-15 


turn, helix, strand 


various cytochromes P450 


398^42 


9-15 


strand, turn, strand, 
turn, strand, helix, 
turn, turn 


various cytochromes P450 (Note: only few fragments in this 
region have hits at shorter lengths) 


448^83 


9-15 


turn, turn, helix, turn, 
turn; heme binding 
site 


various cytochromes P450 



Sensor-type histidine kinase prrB [Mycobacterium tuberculosis] (Q10560) 



Region 


Lengths 


Feature 


Major classes of hits 


230-257 


9-15 


histidine kinase do- 
main, contains phop- 
shohistidine 


various histidine kinases, sensory proteins, ethylene receptor 


373-398 


11-15 


histidine kinase do- 
main 


various histidine kinases, DNA topoisomerase, gyrase, other 
proteins 


400-425 


10-15 


histidine kinase do- 
main 


various histidine kinases, ethylene receptor (cystein synthase 
and tripeptide permease appear in hits for one fragment of 
lengths 10-1 1 in this region) 



Table 7.1: Significant hits to query fragments. 
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7.4 Discussion 

Two kinds of hits can be observed in general: hits to the query protein itself and 
its very close homologs and hits to low-complexity regions of arbitrary proteins. 
There were also few hits to fragments of apparently unrelated proteins which were 
not low-complexity. 

7.4.1 Hits to close homologs 

Most commonly found hits, apart from the low-complexity fragments, were to the 
instances of the same protein in a variety of species and to its close homologs. 
The hits were concentrated in the regions where sufficiently many strongly con- 
served examples existed. In histidine kinases, the hits are found in the histidine ki- 
nase domain, more specifically, according to InterPro, in the His Kinase A (phos- 
phoacceptor) subdomain (230-257) and the ATPase domain (373-398, 400-425). 
PFMFind identified DNA gyrase (a bacterial DNA repair enzyme) as being asso- 
ciated with the (373-398) region, which is also confirmed by InterPro. Hence, in 
the histidine kinase example, PFMFind retrieved strongly conserved, functionally 
important regions, agreeing with the established methods. 

In the case of /9-casein, PFMFind identified a single region corresponding to 
the signal peptide whose role is to target the protein to a particular cellular com- 
partment or, as in this case, to be secreted. The hits were to signal sequences of 
other caseins and other secreted proteins (amelogenin, having a role in biominer- 
alisation of teeth and vitellogenin, a major yolk protein). No hits were found in 
the mature protein segment (mature protein is the precursor from which the signal 
peptide and potentially other parts have been cleaved), mainly because the initial 
hits were only to the other /3-casein instances of which there were not sufficiently 
many to proceed to the next iteration. Apart from these, there were also hits to 
low complexity and transmembrane regions of clearly unrelated proteins. 

In the case of K-casein, the majority of hits were to other K-caseins, the remain- 
der being to low complexity regions. The only difference from the /9-casein case 
is that Uniprot apparently contains more K-casein sequences (that is, more than 
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the minimum number necessary to proceed to the next iteration) so that PFMFind 
obtained the hits over most of the length of the protein. In the /3-lactoglobulin, 
PFMFind found hits to /3-lactoglobulin itself and its close relatives (glycodelin, a 
pregnancy associated protein and other members of lipocalin family) as well as to 
some apparently unrelated proteins such as bacterial RecA (DNA recombination 
enzyme) and SbnH (polyamine biosynthesis). However, under closer scrutiny, 
it appears that at least the SbnH fragment has been identified to belong to the 
lipocalin domain (ProSite ll55l reference PS00213) together with /5-lactoglobulin 
and glycodelin. All regions in /5-lactoglobulin corresponded to identified elements 
of secondary structure. 

Cytochromes P450 are well represented both in SwissProt and in TrEMBL, 
providing sufficient amount of examples to produce good profiles. Unlike with k- 
casein, it appears that only truly conserved regions were identified. Most hits were 
to the other cytochromes P450 (but not always to all members of superfamily - 
sometimes only very closely related cytochromes are retrieved) with the exception 
of the regions associated with turns. 

7.4.2 Low complexity regions and repeats 

Many of the significant hits retrieved by PFMFind were to low-complexity frag- 
ments, for example consisting all of proline or glutamine or histidine. Such frag- 
ments are much more common than would be expected from their amino acid 
compositions, at least in eukaryotes ITtTII and frequently present problems for sim- 
ilarity searches. It is important to note that whenever low complexity regions are 
hit, the profile 'diverges' from the seed: the original sequence becomes no longer 
significant (or at least not most significant) and the profile describes a totally dif- 
ferent target. This is mainly because of compositional bias of the results where 
there are too many 'undesirable' hits which 'take over' the profile for a subsequent 
iteration. Even though the algorithm uses Dirichlet mixtures to smooth the posi- 
tional distributions, it can be swamped by the large amounts of apparently genuine 
hits. The same issue is evident where transmembrane domains, which are strongly 
hydrophobic and not associated with any specific function, are hit (for example. 
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region 3-14 in /3-casein). 

The problem with low -complexity segments has been recognised and several 
tools that identify and filter out such regions exist [|216[ I214[ . In BLAST, the 
default option is for all low-complexity segments to be masked prior to search. 
However, some low-complexity regions may be biologically significant - for ex- 
ample, some bioactive peptides could be classified as low-complexity. A different 
way to avoid the effect of compositional bias is to use Z-score statistic based on 
the distribution of scores of the fragments having the same composition as a given 
hit but different order of amino acids [I205II . While this approach is commonly 
taken where global alignments are used, it fails to give sufficiently many suffi- 
ciently significant fragments of short lengths (datasets are too large and n\ is too 
small for small n). 

Hence, it appears that selective filtering of low-complexity hits is necessary. 
Highly compositionally biased fragments of query sequences should be filtered 
prior to search. Other fragments should be filtered at profile construction time, if 
computationally feasible. The aim should be to retain as many of the results while 
ensuring that the profile does not diverge. One of the reasons for appearance of 
low-complexity fragments within the results is the relaxed significance require- 
ments for the first few iterations but one should take care in that respect because 
genuine hits also have low significance at first. 

The PrP searches have revealed a further weakness of the current PFMFind al- 
gorithm and implementation. Most of the PrP hits were to the sequence itself and 
its very close, almost identical homologs. While the numbers of such sequences 
are not too large, the structure of the PrP itself, containing many aromatic-glycine 
tandem repeats was responsible for very large result sets: every PrP homolog ap- 
peared several times (in a different region) as a hit for a single fragment. This 
made it impossible to proceed because the current implementation of PFMFind 
stores all results in main memory. The problem should be rectified by better fil- 
tering/weighting of hits and storage of results on disk, to be retrieved as needed. 
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7.4.3 Issues with algorithm and implementation 

A major issue that dominated all examples of PFMFind searches presented here 
was the non-homogeneity of the database. Some proteins are extremely well rep- 
resented, containing instances from a variety of species, some are very rare while 
others have multiple instances from few species. Subsection 17.4.21 discussed the 
problems arising from low-complexity fragments. However, K-casein case has 
shown that too many instances of the same protein can also present difficulties at 
least due to overfitting. Weighting of hits prior to profile construction is clearly 
a solution but it is necessary to use weighting that could lower the total weight 
instead of just redistributing it. An even better approach would be to use other 
information (structure, function, domains) contained in the databases as well as 
sequence information. However, the quality of annotations varies considerably 
and this would present an implementation challenge because it would require full 
access to annotated databases by the PFMFind algorithm. 

PFMFind would also benefit from access to biological information because of 
general low significance of short fragment hits under the current statistical model. 
A Bayesian model, including the prior information available as annotation, could 
be more appropriate, provided that sufficient data is available. One must note 
however, that any increase in complexity of profile construction algorithm would 
affect the running time. Already, except in rare cases, similarity search does not 
take the most of the running time of PFMFind. This can of course be attributed to 
the good performance of FS index. 

7.5 Conclusion 

The six examples have shown that PFMFind is able to identify the regions in 
the query sequence that are strongly conserved and functionally important in the 
closely related proteins as well as in some apparently unrelated proteins. The re- 
sults also indicated that some sort of filtering of low-complexity hits and repeats 
is desirable. Several improvements to the algorithms and implementation are nec- 
essary before large-scale experiments can be conducted. 
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Chapter 8 



Conclusions 



The motivation for this thesis comes from the biological objective of developing 
the methods for discovering the origin and function of short peptide fragments 
with conserved sequence. While most of the current approaches to protein se- 
quence analysis consider either full sequences or longer domains, short fragments 
have significant biological importance on their own. For example, there are sev- 
eral peptide fragments in various milk proteins that are cleaved during digestion 
and have possible physiological activity. Other peptides, from completely unre- 
lated organisms, may have the same activity. Hence, from a biological point of 
view, it would be very useful to have the tools to discover the relationships be- 
tween short fragments that do not necessarily extend to whole proteins. 

As in the analysis of the longer sequences, the primary technique used to relate 
the short fragments is similarity search: we find similar fragments to a given query 
fragment and associate the function of the search results of the known function to 
it. The existing methods such as BLAST proved inadequate, primarily for reasons 
concerning computational efficiency - they were too slow for the large number of 
searches that were considered necessary. Hence the need to construct an efficient 
index for similarity search in short peptide fragments that would speed up the 
retrieval of queries. 

Indexing a dataset in an efficient manner is only possible through a good un- 
derstanding of the geometric properties of the similarity measure on it. While 
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most existing indexing techniques assume that the similarity measure is given 
by a metric, that is, a distance function, this is not the case for biological se- 
quences where the similarity measures are generally given by similarity scores. 
The principal reasons for using similarity scores in biology are that they have 
fewer constraints and have information-theoretic and statistical interpretations. 
For our work, as a similarity measure, we have chosen the one given by the un- 
gapped global alignment between fragments of fixed length because we believe 
that gaps do not have major importance in the context of short fragments. 

One of the important results of the thesis is the discovery that many of the 
widely used BLOSUM similarity score matrices, restricted to the standard amino 
acid alphabet, can be converted into weightable quasi-metrics (metrics without the 
symmetry axiom), which generate the same range queries as the original similarity 
scores. 

This in turn lead to the following questions: 

(i) What is known about the quasi-metrics and what are the principal examples? 

(ii) Can the results from asymptotic geometric analysis be extended to quasi- 
metric spaces with measure and applied to the theory of indexing for simi- 
larity search? 

(iii) Can some insights from the theory of quasi-metrics be used to build an ef- 
ficient indexing scheme for short peptide fragments that can be applied to- 
wards answering the original biological problem? 

(iv) Does the relationship between similarities and quasi-metrics on the alphabet 
extend to local (Smith- Waterman) alignments between full sequences? 

Chapter [2] answers the first question above. Quasi-metrics generalise both 
metrics and partial orders and are well known in topology and theoretical com- 
puter science. The main motif that is encountered with quasi-metrics is duality: 
the interplay between the quasi-metric, its conjugate and their join, the associated 
metric. The novel contribution of the Chapter [2] is the construction of the uni- 
versal bicomplete separable quasi-metric space V. This space is an analog of the 
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well-known Urysohn metric space and is universal, ultrahomogeneous and unique 
up to isometry. The main motivation for constructing such space was to provide 
a previously unknown example of a quasi-metric space and to lay foundations 
for future work. In particular, the universality property means that all bicomplete 
separable quasi-metric spaces can be studied as subspaces of V. 

The second question is considered in Chapters |4] and \5\ The main object in- 
troduced there is pg-space: a quasi-metric space with probability measure. The 
notion of concentration functions from asymptotic geometric analysis can be de- 
fined for pg-spaces in a way that emphasises duality - instead of one concentration 
function, we have two: left and right. The main theoretical result of Chapter|4]is 
that a 'high-dimensional' quasi-metric space is very close to being a metric space 
- in other words, that asymmetry is being lost with concentration. In the context 
of the theory of similarity search, the thesis extends the theoretical framework 
for indexing metric spaces to quasi-metric spaces by introducing the concept of a 
quasi-metric tree. Furthermore, the developments from Chapter H] are used to give 
bounds for performance of quasi-metric indexing schemes. 

Chapters [6] and |7] give answer to the third question. FSIndex was developed as 
an indexing scheme for fragments of fixed length based on two principles: reduc- 
tion of the amino acid alphabet based on biochemical properties of amino acids 
and combinatorial generation of neighbours in the space of reduced fragments. It 
uses distances to reduced sequences as certification functions and thus combines 
the insights from biochemistry and geometry, having significantly better perfor- 
mance than existing indexing schemes (by 1-2 orders of magnitude). In addition 
FSIndex can be also used for profile-based searches and as such provides the main 
component of PFMFind - a system for retrieving short conserved motifs from pro- 
tein sequences. The preliminary experimental results from Chapter |7] show that 
PFMFind is very good at identifying conserved regions but has some problems 
with fragments of low-complexity. FSIndex also offers useful insight into the 
nature of indexing in general. 

The fourth question leads to what we consider as another important contribu- 
tion of this thesis to bioinformatics and computational biology: the discovery of 
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the relationships between local similarities and quasi-metrics in Chapter [3l un- 
der the assumptions satisfied by the most widely used similarity score functions. 
The most significant aspect of this discovery is the triangle inequality property 
which could lead to novel applications to clustering and of course to indexing for 
similarity search. 

8.1 Directions for Future Work 

While the phenomenon of concentration of measure is well-researched for many 
classical objects of mathematics, the contribution of the Chapter |4] of this thesis 
and the corresponding paper in Topology Proc. II181II is only the beginning. Many 
non-trivial questions are opened by introducing asymmetry, that is, by replacing 
a metric by a quasi-metric. For example, it would be interesting to generalise 
Gromov's 11791 metric between mm-spaces to mg-spaces and hence to obtain a 
framework for discussing convergence to an arbitrary mg-space, where concen- 
tration of measure is a particular case of convergence to a single point. Similarly, 
one would want to find out if Vershik's [I197II relationships between mm-spaces, 
measures on sets of infinite matrices and Urysohn spaces, can be extended to mq- 
spaces. Finally, the task of constructing a universal quasi-metric space that is not 
bicomplete, as well as a universal quasi-metric space complete under different 
notions of completeness remains open. 

Turning to indexing schemes for similarity search, while other factors play no 
doubt a significant role, the performance is principally determined by geometry. 
The main task ahead is to further adapt the concepts of abstract asymptotic geo- 
metric analysis to datasets, which are discrete but growing objects and to develop 
computational tools and techniques for predicting and improving performance. 
It is clear that due to the Curse of Dimensionality, indexing 'high-dimensional' 
datasets gains nothing. However, it is a common perception that, in reality, useful 
datasets are never intrinsically high-dimensional. It remains a highly challeng- 
ing geometric problem to formalise this perception, first in geometric terms, and 
subsequently algorithmic. 
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Unfortunately, many indexing schemes perform badly for datasets that cannot 
be said to be 'high-dimensional' - recall the performance of M-tree and mvp-tree 
for datasets of protein fragments - and therefore, there is a lot of scope for im- 
provements to existing algorithms and data structures. Another general observa- 
tion, made apparent from experiences with FSIndex, is that additional knowledge 
of domain structure could be of significant help in developing an indexing scheme. 

FSIndex has shown its usability for searches of protein fragments. Another 
possible application that ought to be examined is as a subroutine of a full sequence 
search algorithm. The experiments using the preliminary versions of PFMFind 
have shown its significant potential for finding short conserved patterns in pro- 
tein sequences. It remains however, to make further improvements in order to 
eliminate problems associated with low-complexity sequences. 

The relationship between similarities and quasi-metrics also opens the possi- 
bility of characterising the global geometry of DNA or protein datasets directly, 
without resorting to projections or approximations. As quasi-metrics capture 
many important properties of biological sequences, it is an opinion of the thesis 
author that asymmetry should be cherished rather than avoided by symmetrisa- 
tions. 

A general conclusion from this work is that methods based on asymmetric dis- 
tances and measures have a future in analysis of data, especially in bioinformatics 
and computational biology, and those applications, in turn, can provide directions 
for further mathematical research. 
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Appendix A 
Distance Exponent 



In this Appendix we outline some methods for estimating the dimensionality of 
datasets based on the distance exponent of Traina, Traina and Faloutsos I1188II . A 
more rigorous definition of distance exponent is introduced and the methods for 
estimating it are tested on some artificial datasets of known dimensions. 

A.l Basic Concepts 

We give a brief introduction to the Hausdorff and Minkowski fractal dimensions. 
All the definitions and results are from the book by Mattila HI 341 and the reader 
should refer to it for more detailed treatment. 

Definition A.1.1. Let X be a separable metric space. The s-dimensional Haus- 
dorff measure, denoted IK'* is defined for any set A C X by 



:K'(A) =limK; 

sio ' 



where 



^I(A) = inf < ^ diam(Ei)" : Ac[jEi, diam(^i) < S 




▲ 
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It can be shown that 3-^* is a Borel regular measure. The measure corre- 
sponds to the counting measure while CK^ has an interpretation as a generalised 
length measure. In W, J{"(«B^(x)) = (2r)". 

Definition A.1.2. The Hausdorjf dimension of a set A C X is 



The Hausdorff dimension has some desirable properties for the dimension 
namely: 

• dim A < dim5 for all A C C X, 

• dim yy^]^ Ai = supj dim Ai for C X, z = 1, 2 . . ., and 

• dim W- = n. 

Hence < dim A < n for all ACM". 

Definition A.1.3. Let A be a non-empty bounded subset of M". For < £ < oo, 

let N{A, e) be the smallest number of e-balls needed to cover A: 



dimA =sup{s : :K'{A) > 0} = sup{s : = oo} 

= inf{t : :K\A) <oo} = M{t : :K\A) = 0}. 



A 




The upper and lower Minkowski dimensions of A are defined by 



dimjv/A = inf{s : lim sup X(y4, = 0} 

ej,0 



and 



dimM^ = inf{s : liminf iV(A, = 0}. 



A 
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It follows from the definitions that dim A < dim^^ < dinijv/^ < n and 
these inequalities can be strict. Equivalently, 

\ogN{A,e) 



dimjvf^ = lim sup 

ej,0 

dimjvf^ = lim inf 



eio log(l/e) 
log N{A,e) 



eio \og{l/e) 

The following theorem provides a motivation for considering the fractal di- 
mension to be the exponent of the growth of the measure of a ball, at least in 

Theorem A.1.4 ( 1116811 '). Let A be a non-empty bounded subset o/M". Suppose 
there exists a Borel measure fi on M" and positive numbers a, b, vq and s such that 
< 12(A) < < oo and 

< ar"" < fi{B{x, r)) < for'" < oo 



for all X E A and < r < Tq. Then dim A = dim^A = dimjvfA = s, where 



dim A is the Hausdorff dimension and dim^^ and dim^/A are the lower and 
upper Minkowski dimensions of A. □ 

Traina, Traina and Faloutsos [I188II observed that the distributions of distances 
between points of many existing datasets follow a power law for small distances 
and proposed a concept of distance exponent as an estimate of the fractal dimen- 
sion of datasets. By their definition, the distance exponent is the slope of the linear 
part of the graph of the distance distribution function on the log-log scale. How- 
ever, a more rigorous definition is necessary, because the power law is only an 
approximation and it is difficult to ascertain the exact bounds of the linear part. 
We define the distance exponent in the framework of pm-spaces. 

Definition A.1.5. Let (fi, d, /i) be a pm-space. Define F : M — [0, 1], the cumu- 
lative distance distribution function of {Vt, d, ji) by 

F{r) = fi ^ fi{{{x, y) e Q X Q : d{x, y) < r}). 



▲ 
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Remark A. 1.6. Clearly, F(r) is the average measure of a closed ball of radius r. 
By Fubini's Theorem, 

F{r) =fi(S) fJ^{{{x,y) e Q X Q : y e W{x)}) 
d^{y)dfi{x) 



xefi 



Definition A.1.7. Let d, jj) be a pm-space and F its cumulative distance dis- 
tribution function. The distance exponent, denoted /x), is defined by 

=lim.^°^^^''^ 



rj,0 log r 



Note that the distance exponent need not be defined and that it makes sense 
only for the case where Vt is an infinite set and fi a continuous measure. Many 
existing workloads can be modelled in this way, with a domain a large infinite 
space and the dataset a finite sample according to some continuous measure (see 
the Section [5j21)- 

The exact relation between the distance exponent and fractal dimensions in 
general remains an open question - indeed, our definition the Minkowski dimen- 
sion applies only for M". If a set A C M" satisfies the conditions of the Theorem 
IA.1.4[ then clearly < ar^ < F{r) < hr^ < oo for < r < ro and hence the 
distance exponent corresponds to the Hausdorff and Minkowski dimensions. 



A.2 Theoretical Examples 

Although it is usually difficult to derive a general distribution function of distances 
of points on a arbitrary manifold, it is sometimes possible to use the symmetry of 
specific objects and metrics to obtain the exact forms for their cumulative distance 
distribution functions. 
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Let (M, p, P) be a pm-space where M C M" and /x is the density function of 
the probability measure P. Suppose the metric p on M is induced by the norm || • || 
on M". Denote by B the unit ball with respect to ||-|| (i.e. 1 = {x G : ||x|| < 
1}). Let X and Y be random variables taking values in M according to P. Then 
the cumulative distance distribution function of (M, p, P) is given by 



where fx-Y is the density function of differences X — Y. The integral above can 
be quite hard to evaluate in closed form but there are cases where this poses no 
problem. Two of such cases are provided for illustration. 



Consider the pm-space (M, p, p) where M is the unit cube [0, 1]", p is the i^o 
metric (i.e. p(x, y) = maxi<i<„ \yi — Xi\) and p is a uniform measure on M. The 
density function fx is given by 



F(r) =Pr{\\X -Y\\ < r) 
=Pr{X - r G rl) 



(A.l) 




A.2.1 The cube [0, 1] 



n 




(A.2) 



Observe that fx is a product of uniform distributions on [0, 1], that is: 




(A.3) 
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Thus 



fx-Y{t)=llfx,-YXU) 

i=l 

n 

■t=i 

" POO 

i=i 

" <>oo 

= n / fxA^)fxA^ -U)dT since f^vXyi) = fvA-Vi 
1=1 

n „i 

= n / fxAr-U)dT 
i=i -^0 



fxX-Vi 



Now if g{u) = /g fx, (r - u)dT then 51 = < 



1 + u Hue [-1,0], 
1-n ifnG[0,l], 
otherwise. 

Remember that the unit ball with respect to the ioo norm is [—1, 1]" and therefore 



Fir) =Pr{\\X-Y\\^<r) 
=Pr{X -Y e [-r,r]") 

fx-ydP 



llg{ti)dti 

Yltl'^Ioi^-i^)dt^ ifO<r<l, 
1 ifr>l. 

(2r-r2)'^ ifO<r<l, 
1 ifr>l. 



It therefore follows that D{il, p, yu) = n as expected. 
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A.2.2 Multivariate normal distribution 

Now consider the pm-space {M,p,fL) where M = M", p is the ^2 metric (i.e. 
p(x, y) = a/ ivi — Xiy) and yU is a multivariate Gaussian measure (normal distri- 
bution) on with mean and variance 1 in all coordinate directions. The density 
function fx is given by 

Again, fx defines a product distribution as in the Equation (IA.3I) . where fxi (xi) = 
Yli^i exp ^— ^ j • Hence, we can use the fact that fxi is an even function and 
a well-known result that the sum of two normal random variables is a normal 
random variable where the mean is the sum of means and the variance is the sum 
of variances of these random variables, to conclude that 



"1 f 



exp — - ||r|| 



{2^Y 4 



Let git) = (27^ 6xp • Using the radial symmetry of fx-Y and the spheri- 

cal coordinates. 

Fir) =Pi\\X-Y\\,<r) 

=P{X -Ye rB") (B" is the Euclidean unit ball) 

fx-vdP 



9m\)dP 



t'P-^g{t)dt 







exp — — at 



r(f)yo (2v^)" "V 4 

r 

^ [2) Jo 
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The above expression can be evaluated as power series. Let Hn{r) = m" ^ exp{—u'^)du. 
Then 



Hn{s) 





r 




2 


H 




2 Jo 



n-2„-r2 ^ _ 2 



2 2 

The above recurrence relation can be solved for even and odd n separately. If n is 
even, 

{n-2){n- 4) . . .4:.2.H2{r) 



2 



2n/2-l 



1 



n 



-1 ! 



1 1 1 n/2-l „_2fc 



2 2 



1 „2 /n 



^2fc 

k=0 



If n is odd. 



n/2 



(n-2)(n-4)...5.3.//i(r) 



2 



1„ 

2 



n-l 

2— 



Tl-l 

2 



(n-2)(n-4)...3 



2^ 



n-2fc 



^r(f + i-fc) 



°° ^2fc-l 



n-l 
2 .,2fc-l 



^r(fc + i) lJr(* + i) 



„2fc-l 
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Therefore, 



e En/2W^ ifniseven, 



nr) = { ■ (A.5) 



e 



Er=^^5drF(^ if n is odd. 



and hence it is not difficuh to verify that D (M, p, p) = n. 



A.3 Estimation From Datasets 

Two algorithms were used to estimate the distance exponent from artificially gen- 
erating datasets corresponding to geometric objects of known dimension. In each 
case an estimate F of F was obtained by taking a random sample X' C X C 
and calculating all distances between the points in X'. Therefore, 

F(r) = fi"{{{x,y) eX'xX': d{x,y) < r}) 

where /i" is the normalised counting measure on X' x X'. All computation was 
handled by the MATLAB package ni87ll . In all cases (i.e. for all dimensions) 
the artificial datasets consisted of no more than 20000 points while approximately 
200000 distances were sampled to obtain F. 

The main algorithms tested were based on calculation of the slope of the 
logF(r) vs logr graph (original definition of Traina, Traina and Faloutsos II188II ) 
and the fitting of polynomial to F, both for small values of r. A third method 
which was tried was based on estimation of derivatives but was not successful for 
the objects of dimensions greater than 3. 

The following artificial datasets were used to test the estimation algorithms: 

• Euclidean spaces M" with standard multivariate normal (Gaussian) distribu- 
tions and £2 metrics; 

• Cubes [0, 1]" C with uniform distributions and £2 metrics; 

• Spheres C M" with uniform distributions and £2 and geodesic metrics; 

• Parabolic through in with £2 metrics. 
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All objects were generated using the built-in MATLAB routines which provide 
random vectors in according to the Gaussian or uniform distribution. These 
routines were used directly to generate the multivariate Gaussians and the cubes 
while additional transformations needed to be applied for the remaining spheres 
and parabolic throughs. 

Uniform distributions on the spheres were obtained by projecting multivariate 
Gaussian vectors in M" onto the unit sphere We define a parabolic through 
P to be a surface in M" which is a Cartesian product of a parabola (x, cx^) where 
a; G [a, 6], a < < 6, and a — 2 dimensional cube (Figure IATT]) . In order to 
obtain the uniformly distributed points on P, it is sufficient to generate uniformly 
distributed points on the parabola and the cube separately. Uniform distribution 
on parabola was obtained by parameterising the parabola by arc-length, sampling 
from the uniform distribution on [0, 1] and mapping the sampled points to the 
parabola. 




Figure A.l: A parabolic through in 



A typical example of the function F and its sampling approximation F is 
shown in the Figure IAT21 below. 
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Figure A.2: The cumulative distance distribution function F and its approximation F 
for the nine-dimensional multivariate Gaussian distribution. Top - linear scale; bottom - 
log-log scale. 



A.3.1 Estimation from log-log plots 

The definition of Traina, Traina and Faloutsos II188II involves estimation of dis- 
tance exponent from the slope of the 'linear part' of the log-log plot of the cu- 
mulative distance distribution function F . Our implementation produced a least- 
squares estimation of the slope of logF vs logr on a given interval [a, 6]. The 
end-point of the interval was the fifth percentile (i.e. the smallest value h such that 
F{b) > 0.05) while the starting point was chosen so as to avoid the first few points 
corresponding to very small distances which were found not to be good estimates 
of the true distance distribution function F (see the Figure IA!2|) . The estimates of 
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dimensions of some of the above mentioned objects using this method are shown 
in the Figure |A3] 






Figure A.3: Approximation of distance exponent from the slope of logF vs logr: esti- 
mated vs true dimension. Datasets: (i) multivariate Gaussian on with £2 distances; (ii) 
uniform distribution on the sphere with geodesic distances; (iii) uniform distribution on 
the parabolic through with £2 distances. 



It is clear that our algorithm systematically underestimated the dimension of 
objects of 'true' (i.e. expected) dimension greater than 3. The distance exponent 
estimates for multivariate Gaussians and spheres did not differ to a significant 
extent while the dimension of parabolic throughs was underestimated to a greater 
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degree than in the other two cases. 

In order to find an explanation for our results we sampled the exact values of 
F for the multivariate Gaussian on (Equation (IA.5|) ') and applied our algorithm 
to them. The results are shown in the Figure lA!4l 



20 r 



-e- using sample data 

-B- using theoretical function 

- - true dimension 




8 10 12 

True Dimension 



Figure A.4: Approximations of distance exponent for multivariate Gaussian distributions 
from the slope of log F vs log r using 5% of sampled points. Approximations using the 
exact values of F in the same interval are also shown. 



It can be observed that the estimates of distance exponent obtained using the 
true values of F (which has no variance due to sampling) are not significantly 
better than those obtained using the approximation F. We conclude that most of 
the observed error is due to bias: F (and therefore F) is not linear in the region 
used for estimation of the distance exponent). A method based on weighted least 
squares, giving more weight to smaller distances (or equivalently reduction of 
the interval to include very few points, equally distributed along the 'linear part') 
brought some improvement up to the dimension 7 at a price of instability due to 
variance (Figure IA31) . 
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A.3.2 Estimation by polynomial fitting 



The second approach was based on the least squares approximation of F near zero 
by a polynomial Qp{x) = J2^=i o-iX^'^. The estimation of distance exponent S) 
was based on the assumption that there exists L such that for x G [0, L], F{x) ~ 
Q^ix)' and hence that the polynomial Q'^ would have the best fit to F among all 
other Qp's. The polynomials were in computed as follows. 

Let Ui = F{xi) fori = 1,2, ... ,m where Xm = L. Given a possible dimen- 
sion p, and the number of terms of the polynomial n, we want to find Qp which 
such that the L2 norm of the differences between Qp and the sampled function F 




Figure A.5: Approximations of distance exponent for multivariate Gaussian distributions 
from the slope of log F vs log r using only 15 sampled points. Approximations using the 
exact values of F in the same interval are also shown. 
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is minimal. Taking into account that F is a step function, we minimise 

2 

dx 



+ x^f ^X^a^x*"M dx 



m—l 



i=i i=i 

"^0 i=l fc=l 

m— 1 n n n 

j=l 1=1 i=l A:=l 



where 

1 



P+i r^P+i jJ2p+i+k-l 



Differentiating with respect to each cii we get for each i = 1, 2 . . . n, 

n m—l 

AfcOfc = ^ Cyt/j. (A.6) 

fc=l i=l 

Thus we have a system of linear equations Da = b where bi = Xlj=i^ VjC^ij which 
can be solved numerically. For our computations only the one term polynomials 
were used and in that case the Equation lA. 61 is reduced to 



2p -\- 1 



Given the value of L, the estimate of distance exponent was obtained by com- 
puting the errors for different values of p and selecting the value of p for which 
the Qp produced the smallest error. For our tests only the integral values of p were 
tried since it was known that the datasets had the integral dimensions. In general, 
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the optimal value of p can be obtained by numerical optimisation. For the compu- 
tations, the F data was divided into two equally sized sets: the 'training' set was 
used to compute the coefficient of the polynomial and the 'testing' set to compute 
the errors. 




5 10 15 20 5 10 15 20 

(iii) (iv) 

Figure A.6: Approximation of distance exponent by fitting monomials ax^: estimated vs 
true dimension. Datasets: (i) uniform distribution on cube with £2 distances; (ii) multivari- 
ate Gaussian on M" with £2 distances; (iii) uniform distribution on sphere with geodesic 
distances; (iv) uniform distribution on sphere with L2 distances. 



The problem of choosing L (that is, the number of points) was solved by con- 
sidering a variety of endpoints and picking the maximal value of estimated dis- 
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tance exponent among all of them. This approach was based on the observation 
that the value of p for which Qp fits F the best has a maximum which is usually 
(for the low dimensions) the true dimension. The estimated dimension drops for 
L close to zero because few points are used and a large variance component is 
present and also because the first few points of F usually overestimate F. On the 
other hand, if L is large, the behaviour of F is no longer dominated by x^. 

The above heuristic method gave surprisingly good results for our simple ob- 
jects (Figure IA.6I) . The approximations using the above heuristic method were 
much closer to the true dimension than those using the slope of log F vs log r. 

While it was hoped that the polynomials with more than one term could be 
used, allowing us to use larger values of L, the approximations were not as accu- 
rate as those obtained by monomials and their interpretation was more difficult. 

A.4 General Observations 

It should be noted that estimation of the distance exponent appears to be an ill- 
posed problem because it is essentially equivalent to calculating derivatives of 
F around zero (one can prove using I'Hopital's rule that if distance exponent is 
k then the first k — 1 derivatives of F at must be 0). We met the variance 
against the bias problem in both proposed methods. A large interval in which F 
is approximated by F was necessary in order to reduce the variance (since a small 
interval meant that fewer values of F were available) but it introduced the bias 
which lowered the estimate of the dimension (since the behaviour of F was no 
longer dominated by x®. In addition, in higher dimensions, most of distances at 
which the values of F were available were concentrated very close to the median. 
This was another manifestation of the Curse of Dimensionality. 

In our experiments, the polynomial fitting approach performed better in the 
higher dimensions than the estimation from log-log plots. It should be noted that 
all the datasets tested by Traina, Traina and Faloutsos [I188II had the dimension less 
than 7 (in some cases only estimates were available) so that the underestimation 
we observed was not as pronounced as in higher dimensions. Our polynomial 
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fitting algorithm can be improved by using numerical optimisation to find the 
optimal values of p and L. 
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